"AD-A041  246  MASSACHUSETTS  INST  OF  TECH  LEXINGTON  LINCOLN  LAB 
, SPEECH  evaluation. (U) 


SEP  76  B GOLD 


UNCLASSIFIED 


ESD-TR-76-382 


Fl962fl-76-C-0002 


■**w> 


-nfA 


mm 


.w»' 


v^ 


** 


•;-?lV'.>, 

i:t 


■^m 


m 


m 


','  J ■ ii*  t 


■!d±2t  TitJ  '.i  ^-:'.  ^ 


v>^ 


tit; 


tt-j^ 


mm 


:*''.S' 


■■.  ■ i./rir-; 


^-rw 


i',- 


m 


:'■« 


^56 


ifi^W 


n£ 


f>.^ 


■■  ■ *-3n-:'^  ■?s-. 


■■} . . ■'■  .■'-  .'■-■:tV.‘-  ^’-  ■ ■■■  '■^'.i-V!Sf.’''.:.i^ 


'V’'^j£i}^4 


*3S 


4$ 


-1* 


-Si-'- 


.v.'Vs 


■ : ' s5 


•nrr 


SPEECH  EVALUATION 


ANNUAL  REPORT  FOR  FY  1976-7T 
TO  THE 

DEFENSE  COMMUNICATIONS  AGENCY 


30  SEPTEMBER  1976 
ISSUED  16  MAY  1977 


Approved  for  public  release;  distribution  unlimited. 


D D C 


MASS  \('HUSKTTS 


y 


/ 


! 


I 


A n S T R A C T 

This  volume  reports  the/ work  -performed  duriTut  FY  76-7T  on  the  IX  A 
Speech  Evaluation  Contract.  Work  during  this  period  on  System  Implica- 
tions of  Packetized  Speech  for  DCA  is  reported  under  separate  cover. 

-Three  general  areas  of  work  are  reported. in  this  document: 

(1)  Work  on  narrowband  terminal  * robustness^' 

(2)  Work  on  wideband-nar rowband  tandeming ^ 

(1)  Hardware  speech- terminal  efforts. 

The  robustness  issues  are  defined  early  in  this  report;  then,  work  on 
telephone-line  simulation,  robust  pitch  extraction,  and  operation  of  LPC 
vocoders  in  acoustically  noisy  environments  is  reported. 

This  report  also  discusses  some  approaches  and  progress  made  in  the 
improvement  of  wideband  devices,  and  the  interoperability  of  wideband  and 
narrowband  terminals. 

The  design  and  development  of  a microprocessor-based  LPC  vocoder, 
as  well  as  some  work  on  the  development  of  charge-transfer- device-based 
channel-vocoder  equipment,  also  are  described. 


iii 


FRfiCaDI.'O  PA?JE  SLAJ0C-1)CT 


C O N T K N T S 


r 


I 


Abstract  iii 

AcknowlcUi'mcnts  vi 

I.  PROGRAM  OVFRVrr.'W  - FY  197(.-7T  t 

A.  Robust  Speech  l^roccssinij  1 

1.  Acoustically  Coupled  ftack(;round  .\oise  2 

2.  Telephone  Input  Speech  2 

R.  Interoperability  of  Wideband  and  Narrowband 

Speech  Terminals  2 

C.  Vocoder  Hardware  Implementations  ? 

1).  Fxperiments  and  IX’monstrations  3 

II.  Ol'TI.INE  OF  THIS  ANNl'AI.  REPORT  5 

III.  ROBUST  NARROWRANH  DIGITIZERS  7 

A.  Identification  of  Robustness  Issues  7 

t.  Variation  Among  Speakers  7 

2.  Tandeming  7 

3.  Background  Noise  7 

4.  Telephone  Speech  7 

5.  Channel  Errors  7 

B.  The  Speaker  Variation  Problem  9 

1.  Volume  Control  10 

2.  Spectrum  Equalization  11 

3.  Adaptive  Quantization  of  the  I. PC  Parameters  11 

4.  Philosophy  of  Speaker  Adaptation  12 

C.  Tandeming  12 

1.  Conferencing  of  Digitally  Processed  Speech  12 

2.  The  Disadvantaged  User  15 

3.  Remarks  on  Real-Time  Simulation  of  Tandeming 

Situations  1 5 

D.  Background  Noise  15 

1.  Summary  of  T&iE  Results  16 

2.  Comments  17 

E.  Channel  Noise  17 

1.  T&iE  Results  17 

2.  Jamming  17 

3.  Selective  Coding  of  Parameters  18 

4.  Summary  18 

F.  Telephone  Speech  18 

IV.  THE  TE1.EPHONE-LINE  SIMIT.ATOR  21 

A.  Overview- Description  of  a Telephone  Channel  21 

B.  Single-Sideband  (SSB)  Modulation  21 

C.  Disturbances  in  the  Transmission  System  24 

1.  Quadrature  Distortion  24 

2.  Filtering  25 

3.  Nonlinear  Distortion  2 5 

4.  Gaussian  and  Impulse  Noise  26 

5.  Echo  and  Crosstalk  26 

D.  Measuring  the  Telephone  Channel  27 

E.  Simplifications  Used  in  the  Simulation  28 


4 


IV 


I 


F.  Implementation  on  the  I. DVT 

G.  I)\'T  Times  and  Space 

V.  THE  HARMONIC’  HITCH  DETECTOR 

A.  Introduction 
n.  Preprocessing 

C.  Peak  Picking  Algorithm 

D.  I.DV'r  Implementation 

E.  Results 

VI.  C:>HT1MI  M SPEECH  CI.ASSIFICATIC3N  AND  ADAPTIVE 
NOISE  CANCELLATION 

A.  Introduction 

B.  Models  for  Silence,  Unvoiced,  and  Voiced  Speech 

C.  The  Optimum  Classifier  Against  White  Noise 

D.  Practical  Implementation  of  the  Estimator-Correlator 
Speech  Classifier 

E.  Pitch  Estimation 

F.  The  Optimum  Classifier  Against  Colored  Noise 

G.  Practical  Implementation  of  the  Estimator-Correlator 
Speech  Classifier 

H.  Experimental  Results 

I.  Conclusions 

VII.  TANDEMING  AND  IMPROVEMENT  OF  IIIGH-RATE 
CODERS 

VIII.  MICROPROCESSOR  REALIZATION  OF  A LINEAR- 
PREDICTIVE  VOCODER 

A.  Introduction 

B.  LPCM  System  Description 

1.  Architecture 

2.  Instruction  Format 

3.  Data-Memory  Addressing 

4.  Timing  Considerations 

C.  Engineering  Considerations 

D.  Debugging  and  Test  System 

1.  Hardware  and  Software  Debugging  Aids 

2.  The  LPCM  Simulator  and  Assembler 

E.  Firmware  Considerations 

t.  The  LPC  Algorithm 
2.  Implementation  of  the  LPC  Algorithm 

F.  Conclusions 

Appendix  A:  LPCM  Mnemonics 
Appendix  B:  I, PCM  Specifications 

IX.  CHARGE-TRANSFER- DEVICE  IMPLEMENTATION 
OF  CHANNEL  VCXTODERS 

X.  CONTINUING  WORK  AND  CONCI.U.SIONS 
Glossary 


V 


^ C K N O W L E D c;  M ENTS 


This  report  for  P'Y  76-7T  to  the  Defense  Communications  Agency 
was  edited  by  .1.  Tierney  and  \V.  .1.  Finnegan.  It  is  based  on  work 
by  B.  Gold,  E.  M.  Hofstetter,  R.  J.  McAiilay,  S.  Seneff,  .1.  Tierney, 
and  O.  C.  Wheeler.  The  .section  on  Charge  Transfer  Devices  is 
based  on  work  by  K.  W.  Broderson  of  the  Electronics  Research 
Laboratory,  Lniversity  of  California,  Berkeley. 


Sl'i;i:C'll  KVAl.UAI  lO.N 


I.  IMKKIKAM  (n'  KHVlI.'W  - 1 Y 197(,-7  i 

I mcoln  1 aboralory’s  commitment  to  the  Defense  Communications  Agency  dui  ing  tiie  jieriod 
1 "1  7i  -7  T was  outlined  in  tiie  1 't  75  linal  report  and  was  incorporated  into  the  Dt'A  statement 
of  work  as  follows: 

(a)  Tandeming  and  conferencing  experiments  using  liigli-,  medium-,  and 
low-rate  digitizers. 

Ibl  I’clephonc  channel  simulation  for  controlled  testing  of  piione-line  dis- 
tortions and  their  effect  on  speecli  digitizers. 

(c)  Investigations  of  sjieech  digitizer  talker  sensitivity,  and  techniques 
for  reducing  this  effect. 

(d)  Study  of  medium-rate  coders  with  a view  toward  improved  qualit. 
and  or  lower  implementation  costs. 

(el  Design  and  development  of  a low-cost  LFC  microprocessor  terminal 
for  use  at  4.8,  3.fc,  and  2.4  kbps.  Such  a terminal  must  be  suitable  for 
large-scale  defense  communication  systems  deployment. 

(f)  Investigation  of  effects  and  cures  for  carbon  button  microphone  inputs 
to  speech  digitizers. 

(g)  Investigation  of  effects  and  cures  for  input  environmental  noise  vul- 
nerability in  speech  digitizers. 

All  these  tasks  are  directly  or  indirectly  related  to  DCA  and  DoD  Secure  Speech  Consortium 
interests. 

For  reporting  purposes,  these  tasks  are  grouped  into  three  major  technical  categories; 

A.  Robust  Speech  Processing 

B.  Interoperability  of  Wideband  and  Narrowband  Speech  Terminals 

C.  Vocoder  Hardware  Implementations. 

in  this  Overview,  we  discuss  the  rationale  for  the  selection  of  these  three  topics  as  the  major 
research  areas  and  summarize  the  conclusions  derived  from  our  findings.  A fourth  topic: 

D.  Experiments  and  Demonstrations 

includes  certain  field  activities  at  DCEC  and  NRL  to  demonstrate  accomplishments  under  the 
contract  tasks. 

In  .Sec.  II,  we  list  more  specifically  the  task.s  accomplished;  the  remainder  of  the  report 
gives  detailed  information  on  these  tasks,  leaning  heavily  on  reports  prepared  during  EY  76-7T 
under  the  Speech  Evaluation  contract. 

A.  ROBUST  SPEECH  PROCESSING 

It  is  now  generally  accepted  that  a variety  of  narrowband  speech  digitizers  work  quite  sat- 
isfactorily under  laboratory  environments.  In  this  effort,  we  were  concerned  with  improving 


1 


t 


poi  l'ormatu'c-  uniii  i i-iivii  uniiH'MlB  such  as  acinislii  allv  chU|j1i’i1  liackjji  ciuml  noise,  li-lc- 

piioiu*  sysieni  ciiannel  tM  t-ors,  ii^pui  tlisioruons  oi‘  eai  istn  Initton  miciati>liones.  and  proiilen\s 
ttf  atypical  talker  ciiaracteristics.  The  t'onsortinm  Test  an<_i  i-.valn.ation  ('Ik  1.1  jno(»ram  re- 
sults' had  alrea<iv  ('onWncin(^l\  denion.sir.'ded  tiiat  inte]li^lilnlit\  aiis  otten  .si^'ml icantlv  ri'ducesl 
due  \o  these  de(iratlai  iotts.  liiirinp  i lu?  lourse  of  I'N  7( 1.  we  were  .-idle  in  work  (*n  liie  prob- 
lems of  arousneallv  ta)’ipleil  bai-k^roiiiid  noise  ami  telephtme  input  speei  h. 


1.  .\constieallv  C’onpleti  liaekHround  Nidse 

I'he  maior  nctivit.v  in  tliis  ai  .'a  was  directed  towarcl  modeling  the  noise  background  as  a ^ 
basis  for  a|ipl.vin};  the  methods  of  statistical  decision  ti.eorv.  Attention  was  .specificaiiv  dii  eclcd 
toward  the  voici’d-nnvoiced  decision  problem  since,  in  a noisy  envii  onnu.'ii' . this  component  el 
speech  analysis  was  judged  most  likely  to  degrade.  Furthei  niore.  ext>erience  has  shown  that 
failure  of  tliis  component  causes  a drastic  reduction  in  listener  acceptability  of  the  vocodei 
system.  .A  statistical  decision  criterion  has  been  successfully  a|iplied  (in  non-real-tiniei  in 
the  presence  of  correlated  Ciaussian  noise  background,  anil  a conce|)lual  design  toward  a real- 
time imiilementation  of  the  new  algorithm  has  been  outlined. 

2.  Telephone  Input  Speech 

It  has  long  been  recognized  that  speech  distortions  introduced  by  the  telephone  liandset  and 
by  a variety  of  telephone  channel  phenomena  can  adver.scly  affect  the  performance  of  a narrow- 
band  speech  processor.  One  of  the  ma.ior  difficulties  in  approaching  this  problem  has  been  the 
fact  that  telephone  channels  differ  greatly.  Our  first  step  was  thus  to  set  up  a real-time  flexible 
telephone  channel  simulation  on  a Digital  V'oicc  Terminal  (D\  I 1.  1 his  facility  allows  for  con- 

trolled variations  of  telephone  distortions  caused  by  frequency  reS]>onse.  phase  distortions, 
frequency  offsets,  and  nonlinear  effects.  Our  next  step  was  to  investigate  the  pitch  measui  e- 
ment  problem  of  a vocoder  for  telephone  speech.  This  led  to  an  innovative  design  of  a pitcli 
detector  based  on  a harmonic  analysis  of  the  speech  wave;  our  woi  k sliowed  that  this  form  of 
preprocessing  resulted  in  a pitch  algorithm  that  was  significantly  less  vulnerable  to  both  tele- 
phone phase  distortion  and  background  noise. 

B.  INTKROPfclK.AHIUTY  OF  WIDEBAND  AND  NAHHOWBAND 

SPEECH  TERMINALS 

The  Autosevocom  II  network  will  consist  of  CVSD  terminals  interconnected  by  16 -kbps 
channels.  However,  it  is  desirable  that  a significant  number  of  users  with  lower  available 
bandwidth  be  able  to  use  the  network.  F >r  a narrowband  terminal  to  speak  to  a wideband  ter- 
minal requires  appropriate  interfacing  at  the  switch  between  two  distinct  speecli  algoi  ithms. 

One  specific  method  of  providing  this  interface  is  via  tandem  connections;  this  is  the  method 
we  investigated  in  FY  76-7T.  Since  l.PC  is  presently  considered  to  be  the  most  suitable  nar- 
rowband speech  digitizer,  a gr'oup  of  experiments  involving  CVSD-I.PC  tandems  W'ere  imple- 
mented. The  overall  results  indicated  that  such  a tandem,  in  either  direction,  could  be  made 
acceptable  provided  that  l.PC  was  run  at  rates  of  3600  bps  or  higher,  the  CVSD  parameters 
were  carefully  adjusted,  the  CVSD  channel  error  rate  was  less  than  1 percent,  and  the  voice 
volume  was  kept  reasonably  steady.  Thus,  our  oveiall  conclusions  were  that  1 PC-CVSD  tan- 
deming  could  be  acceptable  under  a limited  set  of  benign  conditions,  but  that  tandeming  was 

♦ This  undetTai^ng  was  supported  in  part  by  the  Defense  Advanced  Research  Projects  Agency 
of  the  United  States  Government. 


2 


'I 


siRniriCLintly  less  robust  tban  citbor  com|x)nent  of  Uie  taiuioni.  Two  sjjcoific  "romoiiios"  weru 
tried:  oni'  was  to  insert  a chirp  filter  between  the  1.1’t'  and  C'VSl)  in  a configuration  where  tlie 
I.i'C  comes  first  — modest  quality  improvements  were  noted;  the  other  was  to  replace  the  t'VSIf 
algorithm  with  a Ifc-kbps  Al'C’  algorithm.  In  our  judgment,  tlie  tandem  of  I I’C  and  16-khps  Al’(,' 
resulted  in  significant  improvement  over  the  I I’C-C'VSl)  tandem. 

<•.  VTKOllKK  llAKnWAliK  .M  ENT  AT  K)t\S 

The  recent  introduction  into  the  marketplace  of  higli-speed  T.Sl  cliips  made  it  feasible  to 
design  a smail  microprocessor  dedicated  to  the  l.Pt:  algoritlim.  I'liis  task  was  accomplislied 
ami  is  documented  later  in  this  re/x)rt;  at  present,  two  working  models  exist  and  one  has  been 
used  for  demonstration  at  DCKC.  W'e  believe  that,  at  tlie  time  of  completion  of  this  project, 
it  represented  the  most  compact  and  potentialiy  least  costly  narrowband  speech  processor  yet 
built . 

-As  a result  of  the  favorable  performance  characteristics  of  the  T.  K.  lielgard  channel 
vocoder  noted  in  the  Nan  owband  Speech  Consortium  i & K program,  a lielgard  vocoder  w'a.s 
implemented  on  the  OV  T.  Comparison  results  indicated  lielgard  to  be  competitive  with  1,1’C 
in  terms  of  roire  quality  ami  robustness,  lielgard  implementation  on  digital  machines  re- 
quires more  than  twice  the  signal-processing  power  of  I.I’C  implementation.  However,  new 
technology  such  as  charge  transfer  dec-ices  and  efficient  MOS  integration  makes  special-purpose 
hardware  realizations  of  transversal  filters  and  detectors  attractive.  In  this  light,  a 3-month 
effort  was  subcontracted  to  the  University  of  California.  Herkelcy.  ITectronics  llesearch  l.ab- 
oratory  to  implement  a full-wave  rectifier  and  desampling  filter  for  channel  vocoder  use.  This 
circuit  has  been  delivered  to  Lincoln  for  evaluation.  In  addition,  a study  at  both  Lincoln  and 
lierkcley  of  a full  vocoder  channel  analyzer,  consisting  of  a bandpass  filter,  a rectifier,  and  a 
low-pass  filter,  indicates  an  efficient  realization  to  be  feasible. 

n.  EXPEKIMKNTS  .AND  DEMONSTILATIONS 

Several  demonstrations  and  tests  were  set  up  and  run  at  both  DCEC-Reston  and  NRL- 
Washington.  Narrowband  consortium  tests  continued  until  early  September  1975  w-hen  the  DCA 
DVT  equipments  were  moved  to  NKl.  for  HE  channel  tests  of  the  I.PC  vocoders.  At  the  end  of 
September  1975,  the  DVTs  were  returned  to  Heston  for  vocoder  demonstrations,  and  were  re- 
turned in  February  1976  to  Lincoln  Laboratory.  In  Juno  1976,  two  DVTs  equipped  with  read- 
only memory  (ROM)  to  make  them  freestanding  vocoders  at  2.4.  3.6,  and  4.8  kbps  (as  opposed 
to  being  loaded  from  a PDP  11/20  as  in  earlier  tests)  were  delivered  to  DCKC  for  additional 
demonstrations.  These  units  were  returned  to  Lincoln  in  midsummer.  .At  the  end  of  FA  7T  a 
tandem  LPC-chirp  filter-CVSD  experiment  consisting  of  a DV  I'  with  ROM  implementing  a chirp 
filter,  a CVSD  encoder  decoder,  an  error  generator,  and  the  newly  developed  microprocessor 
I. PCM  vocoder  were  delivered  to  Reston. 


i 


II.  Ol  TI.lNh:  OF  THIS  ANM  AI.  HFfT)KT 


The  sections  which  fcjllow  describe  each  of  the  majo'*  areas  of  (?ffort  for  FV  76*7T,  plus 
our  conclusions  and  a statement  on  our  continuing  efforts.  'I’ho  area  fjf  "robustness"  of  narrow- 
band  speech  (l^^'it^zers  is  discussed  in  Sec.  III.  Startin^f  from  the  Narrowi>and  Sj>eech  Cfjnsortium 
T^E  results,  the  remaining?  serious  problems  of  vocoder  sensitivity  are  f)Utlined.  In  Sec.  I\’, 
the  results  of  a telephono-line  simulator  programmed  on  the  Lincoln  I.aborotory  signal- 

processing machine  are  presented.  The  simulation  runs  in  real  tirjie.  allowing  an  experimenter 
to  add  standard  phone-line  distortions  to  a spe?ech  signal  under  study  before  processing  the  sig- 
nal with  a narrowband  device.  Section  V shows  an  application  of  tlie  telephone-line  simulator. 

In  particular,  this  Section  discusses  development  of  a fundamental  frepuency/pitch  extractor 
for  telephone-line  distorted  speech.  A device  of  this  type  is  a necessary  f)ai’i  of  present  nar- 
rowband speech  terminals.  An  a()proach  to  the  problen)  of  I.FC  vocc)ding  of  noisy  speech  is  pre- 
sented in  Sec.  VI.  The  noisy  signal  may  have  been  produced  by  passage  through  a noisy  channel 
or  by  a talker  in  a bad  acoustic  environment.  This  work  is  complementary  to  the  pitch  detection 
work  of  Sec.  V,  since  both  aim  toward  vocoding  of  noisy,  distorti*>d  speech  signals.  Section  VII 
attempts  to  summarize  tiie  various  investigations  toward  imf)roved  C^VSI),  improved  tandeming 
of  eVSD,  improved  tandeming  of  I.PC-CVSD,  and  an  improved  higii-rate  il6-kbps)  APC  system 
for  higher  quality  narrowband -wideband  tandeming.  This  Section  also  presents  some  results  on 
dispersive,  nonrecursive,  digital,  all-pass,  chirp  filters  for  speech  conditioning  between  nar- 
rowband and  wideband  terminals.  Section  \T1I  reports  the  results  of  the  LPC.M  hardware  de- 
velopment; this  program  has  produced  two  microprocessor-based  LPC'  speech  terminals.  The 
design  and  development  process  as  well  as  the  final  specifications  of  the  existing  devices  are 
also  discussed  here.  Section  IX  is  concerned  with  our  small  outside  effort  in  charge  transfer 
device-MOS  implementation  of  a Helgard-like  channel  vocoder.  A statement  of  future  work 
based  on  the  accomplishments  described  here  is  present-d  in  Sec.  X. 


5 


FBECliDIMO  PAOE  ELANK-IIOT  flU'liD 


IV.  TIIK  Ti;i.i;i>MONK-I.INi;  SIMl  I ATOK 


111.  KUlU  Sr  NARRCnUlAM)  DK.l  1 IZl.RS 


.A.  IDKN  TlKtl  A HON  Ol-  ROHI.'SINKSS  ISSUES 

1‘artlv  tlirounh  tin-  Test  and  Evaluation  l l'k.  K)  prottraju  of  tlio  Narrowband  Spi-ccb 
t'onsortiufn,  ami  partlv  tiu-ougii  work  in  our  laborator’V,  wo  can  biontify  the  following 
"robustness"  issues. 

t.  \,ariation  Among  Speakers 

We  obser\i‘,  in  demomsiraling  our  various  speecdi  algorithms  to  visitors,  that  there  is 
a wide  variatioti  in  wdiat  listeners  find  acceptable;  this  variation  seems  to  depend  on  both  the 
speaktM'  ami  tlie  listener.  l or  example,  we  hear  comments  like  "1  read  vou  but  it  doesn't 
sound  like  vou,"  whert'as  anottier  listener  will  be  very  happv  with  tiie  stime  speaker's  identity. 
Manv  of  the  'IWE  tests  on  our  I.l'C  system  show  a definite  drop  in  intelligibility  in  going  from 
male  to  female  speakers.  If  a speaker  speaks  too  softlv  tor  too  loudly),  degradation  usually 
results.  It  is  a commonly  accepted  folk-fact  that  people  who  iiave  iteen  in  the  narrowband 
voice  Itusiness  for  awdiile  mtike  better  speakers.  To  summarize,  it  is  quite  well  accepted  that 
narrowband  svstems  stich  as  I.Pf  are  less  robust  than  a telephone  line  and  may  even  be  unac- 
ceptable for  a large  number  of  speakers. 

2.  Tandem  ing 

('omntunications  networks  tend  to  grow  partly  in  a random  manner,  and  it  is  expected  th.at 
anv  large-scale  network  will  include  channels  of  differing  bandwidths  and  will  therefore  require 
different  speech  terminals.  This  fact,  plus  the  requirement  of  conferencing,  leads  to  the  tan- 
deming  problem  wherein,  for  example,  the  voice  must  be  processed  through  an  LPC  followed 
by  a delta-modulation  system.  TbE  results  have  demonstrated  that  tandeming  of  two  speech 
processors  degrades  intelligibility  and  quality,  and  that  three  or  more  in  tandem  are  generally 
unacceptable. 

5.  Rackground  Noise 

In  many  military  environments,  high-level  background  noise  is  unavoidable.  Again,  severe 
degradation  often  leading  to  unacceptable  results  has  been  demonstrated  from  the  T8iE  results. 

4.  Telephone  Speech 

For  the  foreseeable  future,  we  expect  the  telephone  handset  to  be  the  most  popular  and 
cheapest  of  all  speech  terminals.  Flexibility  of  both  military  and  civil  telephony  would  be 
greatly  enhanced  if  existing  analog  facilities  could  be  integrated  with  new  digital  communications 
facilities.  To  accomplish  this  requires  that  speech  algorithms  remain  robust  after  the  analog 
voice  has  been  processed  through  a carbon  button  microphone  and  a telephone  line.  To  date 
there  is  insufficient  quantitative  data  on  the  effects  of  such  processing,  although  it  appears 
from  informal  experiments  that  degradation  can  be  very  severe. 

‘5.  Channel  Errors 

Channel  errors  which  perturb  the  speech  processor  digital  output  can  cause  varying  trouble 
depending  on  the  robustness  of  the  system  to  such  errors.  One  approach  to  such  problems  is  to 
focus  attention  on  the  modem  and  try  to  reduce  the  channel  error  rate  to  a very  low  number  by 


7 


KRECEDI®  PA'jE  ELANK-HCT  flL'IiD 


TABLE  IIM 

COMPARISON  OF  MALE-FEMALE  DRT  SCORES  FOR  LINCOLN  LPC 


Data  Rate 
(bps) 

Microphone 

Noise 

Male 

Female 

Difference* 

2400 

Dynomic 

Quiet 

83.7 

77.7 

6.0 

2400 

' Carbon 

Quiet 

79.4 

74.7 

4.7 

2400 

! Dynamic 

Office 

82.6 

76.0 

6.6 

3600 

Dynamic 

Quiet 

86. 1 

80.4 

5.7 

3600 

Carbon 

Quiet 

81.6 

74.9 

6.7 

4800 

' Dynamic 

Quiet 

88.2 

82.7 

5.5 

4800 

1 Carbon 

Quiet 

85.5 

82.4 

3.1 

‘Average  difference  = 5.47  percent. 

TABLE  111-2 

COMPARISON  OF  MALE-FEMALE  DRT  SCORES 
FOR  HIGH-RATE  SYSTEMS 
(Dynamic  microphones,  quiet  background) 


System  Nome 

CVSD 

CVSD 

ARC  (CODEX) 
HYll 

ARC  (Lincoln) 
ARC  (Lincoln) 
ARC  (CODEX) 
CVSD 
HYll 
A PC 


Rate 

(kbps) 

Male 

i 

Female 

Difference* 

32 

95.7 

93.6 

2. 1 

16 

91.0 

88.0 

2.2 

16 

90.8 

87.8 

3.0 

16 

90.2 

86.7 

3.5 

16 

92.2 

91.5 

0.7 

9.6 

87.3 

87.9 

-0.6 

9.6 

85.3 

84.6 

0.7 

9.6 

82.3 

75.7 

1 6. 6 

9.6 

79.7 

75.7 

4.0 

8 

89.3 

_ 

87.6 

1.7 

= 2.39  percent. 

1 


means  of  reduiulant  eoiiing  techniques.  However,  economies  may  l)e  gained  by  designing  the 
source  coding  to  he  robust  in  a speech  sense.  In  particular,  demanding  error-free  transmis- 
sion in  a jamming  environment  may  be  more  costly  than  designing  a speech  processor  which 
degrades  gracefully  as  channel  errors  increase. 

H.  rHK  SPKAKKK  VAllIATlON  PllOBI.KM 

At  present,  there  does  not  seem  to  lie  enough  TRil'.  data  to  allow  us  to  establish  quantitative 
reasons  for  the  accepted  fact  that  l.PC  systems  are,  in  general,  not  robust  across  speakers. 
However,  e\  en  a cursory  e.xamination  shows  that  the  TM-‘  female  speakers  scored  consistently 
lower  tlian  the  males,  so  let  us  direct  our  attention  to  the  possible  reasons  for  this  special  ex- 
ample of  speaker  variation. 

Table  Hl-1  shows  some  available  T8iF  results  for  the  I.incoln  l.PC  algorithm  as  imple- 
mented on  the  L.1H  T (I.incoln  Digital  N'oice  Terminal).  This  table  shows  that  for  both  carbon 
and  dynamic  microphones,  for  different  data  rates  and  even  in  one  given  noisy  environment, 
the  Lincoln  l.PC  discriminates  quite  consistently  against  women  for  DRT  (diagnostic  rhyme 
test).  Let  us  inquire  into  some  possible  reasons.  The  whole  problem  can  be  speculated  out  of 
existence  by  assuming  that  the  particular  females  chosen  for  T&iK  speakers  simply  did  not  ar- 
ticulate as  clearly  as  the  males,  and  thus  the  Lincoln  LP(  W'as  not  to  blame.  A look  at 
Table  111-2  shows  that  the  females  scored  somewhat  lower  for  all  but  one  of  the  tested  wideband 
(d.6  to  32  kbps)  systems.  However,  the  average  drop  for  Table  III-2  was  2.39  percent,  which  is 
little  more  than  the  standard  error,  whereas  for  Table  Ill-l  the  average  drop  was  5.47  percent. 
The  difference  between  Tables  lll-l  and  111-2  is  certainly  statistically  significant! 

An  obvious  potentially  significant  parameter  which  might  help  explain  this  is  the  bandwidth. 
For  example,  we  know  that,  for  the  voiceless  fricative  sounds  (s,  sh,  f,  th),  female  spectra 
generally  extend  to  a higher  bandwidth  than  male  spectra.  Increasing  processing  bandwidth 
increases  the  cost  of  the  processor  in  a very  direct  way  by  forcing  an  extension  of  its  computa- 
tional power;  thus,  proposals  for  increased  bandwidth  should  be  viewed  very  cautiously.  There 
is  some  slight  evidence  that  increased  bandwidth  helps  females  from  the  scanty  Relgard  results 
shown  in  Table  II1-3.  This  2400-bps  channel  vocoder  utilizes  substantially  more  bandwidth 
than  the  Lincoln  LPC,  botli  in  its  voicing  detector  system  as  well  as  in  the  spectral  processing. 
For  the  two  quiet  cases  shown,  the  scoring  loss  for  females  is  significantly  lower.  The  large 


TABLE  m-3 

COMPARISON  OF  MALE-FEMALE  DRT  SCORES 
FOR  BELGARD  CHANNEL  VOCODER 

Data  Rate 
(bps) 

Microphone 

Noise 

Male 

Female 

Difference* 

2400 

Dynamic 

Quiet 

87.2 

84.6 

2.6 

2400 

Carbon 

Quiet 

82.6 

80.2 

2.4 

2400 

Dynamic 

Office 

84.6 

77.6 

7.0 

•Average  difference  = 4 percent. 

9 


! 


f 

i 

f 

I 


I ■ 


srorins  loss  for  fenuilrs  for  offi<  c noise  ronUl  be  ntt ril)iiled  to  a < lol)ljurinf;  of  those  liigh  fre- 
qiieneies  which  we  speculated  are  neeiled  to  raise  female  scores.  \pparentlv  t ontradictorv 
restilts  are  obtained  from  a pt-rusal  of  Table  1II-2,  where  female  scoring  was  not  siftnificant  Iv 
lower,  even  tliout;h  most  of  these  hifjher  dafa-rate  svstems  processr'd  equal  or  lower  bandwidth 
speei  a than  the  I.iticidn  l .l’l  . The  most  plausible  theorv  which  is  consistent  with  Tables  lll-l, 
I1I-2,  and  III-l  is  this:  fi>r  a hitjh  data-rate  system,  where  the  speaker's  articulation  is  re- 

produced quite  accuratidv,  enoutih  transitional  acoustic  cues  remain  in  the  proctsysed  speech 
so  that  the  average  listener  can  score  widl  even  in  the  abseni  !•  of  the  higher  frequency  i om- 
ponents.  However,  for  a lower  data-rate  system  these  cues  Isuch  as  formant  transitions)  are 
sufficientlv  lacking  so  that  ;i  higher  score  can  be  brought  in  "bv  the  hack  door"  bv  sim()l'.  in- 
corporating higher  frequencies.  Thus,  we  can  ho(>e  that  algorit hn'.ic  improvements  in  I.l’C 
will  avoid  the  need  to  process  mort'  speech  bandwidth. 

If  bandwidth  is  not  the  main  issue,  other  pre-computer  audio  processing  functions  may  yet 
be  of  first-class  importance.  Two  items  come  to  mind  immetlialelv,  namely,  frequenov  equal- 
ization (pre-emphasis I and  volume  control,  l.td  us  discuss  these  items. 

1.  Volume  t'ontrol 

We  know  that  speech  sounds  varv  in  volume  by  about  -10  dli,  from  the  powerful  vowel  a as 
in  "fat"  to  the  w-eak  fricative  f.  Both  analog  and  digital  speech  processors  can  cope  with  this 
dvTiamic  range,  but  if  we  superpose  vari:ilions  in  the  average  volume  from  very  quiet  speakers 
to  those  who  roar  into  the  handset,  this  gives  us  another  20  to  50  dB  to  deal  with.  T rom  our 
experience,  we  have  grave  reservations  about  the  use  of  commercially  available  volume-control 
circuits  for  audio  pre-processing.  Such  devices  cause  distortions  and  tnake  the  problem  of 
isolating  and  dealing  w ith  many  dis'ortions  caused  by  the  narrowband  processor  more  difficult. 
.Mso  from  our  experience,  we  feel  quite  confident  that  monitoring  of  the  ininit  speech  volume 
by  a competent  human  monitor  could  go  a long  way  toward  avoiding  the  overflow  and  underflow 


Kig.in-l.  "Small"  volume  control. 


problems  in  the  digital  speech  processor,  figure  111-1  indicates  a setup  which  at  least  permits 
research  on  this  problem.  In  this  figure,  we  are  ignoring  tdher  atidio  conditioning  such  as  a 
pre-emphasis  or  pre-sampling  filter.  By  controlling  a commercially  available  digitallv  con- 
trolled analog  attenuator  via  a program  in  the  processor,  much  more  flexible  (i.e.,  "smarter") 
volume  control  can  be  incorporated  into  the  processor,  for  example,  the  program  would  not 
have  difficulty  implementing  the  following  concepts; 

(a)  Kvery  time  an  input  speech  sampli-  is  peak  clipped  on  its  w .ay  into  the 
A-1)  converter,  the  attenuator  is  switched  to  (>  dB  (1  bit)  lower  volume. 


10 


(1>)  II  several  seeonds  of  proeessint;  indicate  that  nf)  speech  was  entered 
into  the  systeni  diirinf’  that  time,  the  attenuator  i ;m  rennain  untouched 
on  tlie  basis  that  no  speech  information  v\as  available  to  make  vohirne- 
conlrol  decisions. 

(ct  If  sevi-ral  sei  onds  of  speech  show  that  the  upper  bits  ol  the  \-l)  con- 
verter have  not  been  tickled  at  all,  the  attenuator  can  be  switi  hed  ’<> 
perntil  prenter  volume. 

2.  Spectrum  1 '.qualization 

This  has  been  the  subject  of  continuinp  controversv.  I’resentK,  there  is  not  et.en  complete 
acreenient  as  to  whether  pre-emphasis  helps  or  hinders  I.l’C  proc  essing,  although  the  predom- 
inant bias  favors  the  use  of  pre-cinphasis.  Atlaptive  pre-emi)hasis  poses  a (iroblcm  since,  at 
the  synthesizer,  compensating  adaptive  de-emphasis  must  be  im  orporated  based  on  th<-  trans- 
mitted pre-emphasis  parameter  (or  parameters).  Noti<  e that  such  extra  transmission  is  not 
needed  for  volume  control.  It  is  interesting  that  sevenil  experienced  narrowband  speech  work- 
ers favor  the  incorporation  of  fixed  pre-emphasis  and  no  de-emphasis.  This  fils  in  with  the 
"back-door"  theory  discussed  above  wherein  poor  prficessing  is  compensated  bv  letting  in  ex- 
cessive high  frequencies.  If  not  for  possible  adverse  effects  in  the  presence  of  hackgro\md 
noise  plus  utter  degradation  in  tandem  situations,  this  might  not  be  too  bad  an  answt  r. 

The  biggest  objection  against  adaptive  pre-emphasis  is  the  diffit  ulty  of  suitablv  measuring 
the  desired  adaptive  parameter,  and  the  subsequent  possibility  of  doing  more  harm  than  good. 

A simple  step  to  trv  would  be  to  incorporate  two  fixed  pre-emphasis  filters,  one  for  males  and 
one  for  females.  Then,  analysis  would  consist  of  (a)  making  a male-female  decision,  (bi  trans- 
mitting a single  bit,  and  (c)  switching  between  one  of  two  possible  pro-  and  de-emphasis  net- 
works at  the  transmitter  and  receiver. 


3.  .Adaptive  Quantization  of  the  l.PC  Parameters 

At  present,  the  I, PC'  parameters  chosen  for  transmission  are  the  pitch,  voii  ing  decision, 
gain,  and  reflection  coefficients.  To  date,  there  has  been  no  convincing  argument  for  or  against 
the  use  of  these  parameters  except  the  practical  one  that  no  better  set  h.a.s  been  found.  It  is  an 
interesting  fact  that  channel  vocoder  parameters  consist  of  the  sampled  spectral  magniUule  or 
some  linear  function  of  these  samples.  It  also  seems  to  be  true,  at  the  moment,  that  the  best 
2400-bps  channel  vocoder  outperforms  all  2400-bps  l,Pl'  implementations  in  the  intelligibility 
tests  of  the  T&  E program.  Over  the  years,  important  intuitions  have  been  built  up  about  the 
way  speech  spectral  magnitude  changes  with  both  frequency  and  time.  In  either  dimension  there 
are  fairly  well-understood  constraints  on  these  changes,  and  thus  clever  adaptive  coding  schemes 
can  be  and  have  been  successfully  developed.  As  yet.  no  correspondingly  sophisticated  coding 
scheme  has  been  demonstrated  for  reflection  coefficients. 

There  is  a sound  reason  why  the  apparently  straightforward  problem  of  quantizing  I, IV 
(or  channel  vocoder)  parameters  seems  to  be  the  most  difficult  and  time-consuming  problem 
in  the  design  of  narrowband  speech  terminals.  Certainly,  for  2400-bps  systems,  the  greatest 
injustice  to  the  spectral  integrity  of  the  speech  occurs  at  this  very  quanliz.er.  Makhoul  has 
shown  that  spectral  sensitivity  to  reflection  c-oef  icient  (k.)  changes  is  greatest  when  k.|  ® 1; 
unfortunately,  histograms  of  kj  show  clearly  that  | kj  | is  rarely  in  the  vicinity  of  unity.  Also, 


1 


Sciu'ir"'  Ikih  ilcmoiist  fato<l  that  a i oditif;  .schi'nie  wtiirh  iitili/a'S  a yivi’n  speaker's  kj  histofiranis 
improves  the  l.l’C  output.  Tile  two  ai>ove  teeliniques  are  aetuallv  eontradirtorv,  and  no  one 
has  vi't  ottered  a way  f)l  resolving  this  eoni  radirtion. 

•i.  I’lnlosoplis  of  Spenki'r  Adaptation 

Altliouttli  no  elear-eut  solutions  arise  from  the  above  iliseussion,  a eoneeptual  apfiroaeii  to 
the  protilem  does  emerye.  It  is  idear  tiiat  sipnifietmt  v.arialions  in  tlit*  voiee  to  Ite  proi'esseil 
liv  a narrowdiand  alporitlim  oeeur  due  to  tile  inti.ate  differetiet'S  .itnonp  S[)eakers,  differetit 
mie ropliones,  differinp  wtiys  wliereity  speakt'rs  liandletlie  Iritulsel,  tiifferetit  pro-emptiasis, 
post -empliasis,  pre-stimitl inp.  post -sttmplitip  filterinp,  different  samplinp  rates,  etc.  It  is 
\er\  urtlikidv  ttial  till  itiese  t. .1  riations  ean  ever  be  standardined.  Wliat  is  needed  is  an  adaption 
me.  hanism.  I’hilosopliirallx,  at  least,  this  ean  lie  ititrodured  by  defitiinp  ,a  third  basic  epoeli 
(in  addition  to  a samplitip  epoch  tind  a fratne  epoc  hi  which  we  could  call  an  adaptation  epoch. 
Intuit iselv,  this  epoch  should  hi  of  about  1 -sec  duration.  The  .system  would  be  collectinp  sta- 
tistical data  durittp  each  siii  h epoch  atul  adjuslinp  system  parameters  (as  opposed  to  voice 
anaivsis  partimeiers)  from  epoch  to  epoch.  Obviously,  this  type  of  tidaptivity  is  useful  only 
for  conv erstitional  speech  communication,  and  would  perhaps  be  a hindrance  for  items  such  as 
word  list  intelligibility  teslinp  (unless  one  insisted  on  extetided  testing  witli  a single  speaker). 

I rom  an  implemetitation  pcitit  of  view,  one  would  gitess  that  a fairly  extetisive  amount  of  addi- 
tion.il  processing  woitld  be  needed,  but  at  a slow  rate.  'I  he  difficult  part  of  the  work  would  be 
the  invention  of  algorithms  that  would  do  more  good  than  harm. 

f.  TANDKMINd 

Since  a single  narrowband  digital  voice  terminal  degrades  the  clear  input  speech,  it  makes 
sense  that  two  or  more  such  devices  in  tandem  will  cause  more  degradation.  The  T&K  results 
bear  this  out.  l or  example,  the  l.incoln  I. IT'  OHT  score  goes  from  83. 7 through  one  system 
to  72. (,  for  two,  and  to  66.5  for  three  in  tandem.  The  P.Ml  scores  go  from  55.-1  to  45.0  to  13.7. 

.-\  surprising  result  emerged  by  comparing  a tandem  connection  of  two  Idncoln  LPt's  with 
an  I .PC-C VSI)  tandem.  Since  CVSI)  is  a w'ideband  (16-kbps)  system,  the  latter  tandem  would 
he  expected  to  be  an  improvement  over  the  tandeniing  of  two  narrowband  systems.  The  actual 
results  were  opposite,  however,  with  I.PC-C'VSl)  yielding  a ORT  of  70.0  compared  with  72.6 
for  LPC-I.PC.  Also,  I.PC-I.PC-l.Pt'  gave  a DRT  score  of  66.5  compared  with  61.7  for  the 
worst  combination  of  a single  LPC  with  two  ('V.SI)s.  The  clear  indication  is  that  the  CV.SI)  al- 
gorithm (at  least  with  the  presently  used  parameters)  can't  cope  too  well  with  the  reproduction 
of  I. PC  speech. 

A wide  variety  of  tandem  configurations  exists  which  could  be  of  practical  interest  in  a 
communications  network.  At  present,  we  are  not  equipped  with  a general  methodology  for  im- 
proving the  voice  quality  of  an  arbitrary  tandem  of  the  same  or  different  speech  algorithms. 
However,  we  can  address  several  specific  pertinent  topics. 

1.  Conferencing  of  Digitally  Processed  Speech 

Ordinary  analog  telephonic  conferencing  is  attained  through  summing  the  individual  speech 
signals  at  a central  node  and  distributing  the  sum  to  all  the  conferees.  ITom  the  speech  point 
of  view,  conference  calls  lead  to  a degradation  in  signal-to-noise  ratio  (SNR).  Most  of  the 
time,  only  one  conferee  is  spt'aking,  yet  the  environmental  noise  associated  with  each  conferee 
is  always  present. 


12 


It  Ls  if.trr*-stm^  to  in<iinr«'  as  to  uliothcr  s«  lc*(  li<;ti  v^)}ll  \<  fo/id  lo  J'avor  thr 

”1(^11. U‘!*"  -o«  ak'T  beiu'l'i’  the  ahovt*  analoj»  t ont(*ronrini>.  \ flovita*  wliit  I.  s«  )f'  tod  tho 

’'U)ud»  st"  '-poakor  -iihI  su()prosso<|  ifu-  sijinals  on  iho  otiior  linos  wotdd  ooptainls'  itnprovo  th«- 
ro'  fi'.'od  >\R-s,  l)Mt  * u'lld  < om  oiv;d)l\  onuso  other  pr(tbh‘ms  ' roni  -ibriipt  switohintr 

ammiy  linos.  Siiu  o uo  aro  not  I'amiliar  with  anv  t om[>-’ wttrk  that  rnav  ha\o  bo<‘ri  done 
botweon  tfu'so  two  < imtoronoint’  methods,  wo  want  ff)  make  ordv  tlif*  I’oJ low jij^<  point:  tor  analoii 
oonft'rt'tu' inr,  suminittp  ilporitluns  are  simpler  to  implenioru  than  selection  .alL’orithnjs  and, 
except  \'ox'  ^ij:n:U-io~noiso  U>ss  and  t<dt*phone-line  dogradalioti,  no  f)ihor  pr(n-essint?  loss  in 
speech  i)nali*\  takes  place.  riie  situation  is  quite  different,  liowever,  for  <'on!or-('ii<  itit:  which 
invohos  sfH-och  dieiti/ors  such  as  c \Sl)  or  I.IH  . l irst,  ust*  of  the  suminitjp  alf?orithm  inipli*.-.'- 
that  tatidcminp  is  necossa rv.  Se(a»rK),  dif,dtal  selection  ))ardware  appears  to  be  t hoaper  than  ttu 
hardware  needed  to  implcnumt  analog  sumrninfr  of  digital  spoocln  l i^tnre  II!-i  shows  nn  oxami- 
pie  of  the  use  of  the  summing  algorithm  for  four  dij^italK  processed  spcei  h signals.  In  co'd^-r 
to  sum  the  speech  (.>f  the  conferet*s,  eacli  in{>ut  bit  stream  from  an  annlv/or  must  first  be  svn- 
thesi;'ed;  then  tl;e  four  r<‘sulting  signals  must  be  added,  and  tfu‘  sun)  anal'./crj  and  distributed 
to  .all  c'onferee  s\nthesizers.  Thus,  a quantitv  of  liardware  is  required  at  the  conf(>r'eni  ing  nodf 
Also,  each  conferee's  speed)  is  processed  through  two  algoritlims. 

In  rig.  111-^  ilie  s yntliesizers  of  !■  ig.  UI-2  are  replaced  l)v  a selection  dc'vlct-  which  chooses 
one  of  the  i.a)nferees  to  send  on.  'I'he  selection  algorithm  and  the  switch  sliould,  in  general,  l)C 
appreciably  simpler  titan  the  collection  of  synthesizers  required  for  I ig.HI-2.  Also,  eacdi 
speech  signal  travels  through  only  a single  processor  so  that  tandeming  degradation  is  removed 


T)a-?-i?ta?l 


'•jUSTEN 


.“U 


r 
1 

-i— id  s 


central  station 


*l1> 


TI> 


-{ZH3 


■{Z}- 


A » ANALTZfW 

s ’ synthesizer 


Kig.  III-2.  Summing  algoritlim  for  four 
digitally  processed  speech  signals  in  a 
conferencE*. 


]rari?i«l 


CENTRAL  STATION 


selected  bit  STREAM 


A -•  ANALYZER 

s « synthesizer 


l-’ig,  lll-l.  nigital  confer«'ncing  using  an 
algorithm  to  route  a single  selected  bit 
sf  rea  m. 


central  station 


I'ig.  III-4.  Configuration  for  a "disadvantaged”  Li*C'  user. 


I t 


central  station 

Fig.  III-5.  Alternate  configuration  for  a disadvantaged  user. 


A * analyzer 

s » synthesizer 


Fig.  ni-6.  Trick  for  real-time  simulation  of  four-way  conference 
using  four  DVTs  and  the  FDP. 


» 


14 


A contVretKiiig  .strntefjy  wliicli  involves  an  overhead  cost  ccf  nci ompanyine  control  inforni.i- 
tion  but  leads  to  simpler  speech-processing  hardware  has  been  cpiite  sue  cessfully  demonstrated 
in  a recent  experiment  designed  by  .1,  horgie.^  Kach  conferee  has  a control  box  with  a red  and 
green  button  plus  a red  and  green  light.  Pressing  the  green  button  signifies  a desire  to  talk, 
while  pressing  the  red  button  signifies  that  the  speaker  is  about  to  stop.  The  green  light  per- 
mits the  conferee  to  talk  (and  be  heard),  while  the  red  light  indicates  that  he  should  not  talk 
(or  that  if  he  does,  no  one  will  hear  him).  Management  of  the  resulting  control  information 
and  channel  allocation  is  handled  by  computer.  This  setup  leads  to  "polite"  conferencing,  with 
no  opportunity  for  interruptions.  Intuitively,  we  feel  that  the  aldlity  to  interrupt  becomes  less 
desirable  as  the  network  delay  increases,  as  it  would  thrcnigh  a satellite  link  or  the  AllPANKT. 
Kven  delays  caused  bv  long-distance  lines  plus  associated  switches  could  be  sufficient  to  make 
1 orgie's  method  attractive. 

d.  The  Disadvantaged  I'ser 

In  many  long-distance  situations,  both  users  in  a point-to-point  voice  communication  do 
not  have  tlie  same  l)andwidtli.  C onsider  a user  who  must  cope  witti  a .2400-bps  channel  and  thus 
has  an  IJ’C  terminal.  On  the  other  end  is  a "wealthy"  user  with  a 16-kbit  channel  at  his  dis- 
posal and  a Cl’SD  terminal.  The  straightforward  configuration  for  this  situation  is  shown  in 
Tig.  111-4.  It  is  clear  that  for  each  CV.SD-I.PC'  conver.sation,  a complete  duplication  of  the  ter- 
minal equipment  is  needed  at  the  central  node.  Furthermore,  this  configuration  results  in 
both  C'VSD-l.l’C  and  1,1'C-C'VSl)  tandeming,  with  the  latter  resulting  in  poor  scores  from  the 
1*1  K results. 

.A  partial  alleviation  of  this  situation,  at  some  extra  cost,  is  indicated  in  Fig.  1II-5  - namely, 
to  provide  the  C'\SD  user  with  an  LFC  synthesizer.  Then  the  CVSl)  --  Tl’C  tandem  remains, 
but  the  worst  offender,  (,('('  - C'\'SD,  is  now  gone.  As  yet,  we  have  no  information  on  the 
potential  production  cost  of  a synthesizer,  but  we  know  that  tw'o  of  the  costlier  items  — the  cor- 
relator and  pitch  cietector  — are  not  present. 

3.  Remarks  on  Real-  Time  Simulation  of  Tandeming  Situation.s 

Conferencing  protocols  and  strategies  should  evolve  based  on  experiments;  in  this  section, 
we  confine  our  remarks  to  the  capabilities  at  Lincoln  for  simulating  the  situation  shown  in 
Figs.  III-I  through  lH-3.  A configuration  involving  four  DVTs  and  the  FDP  has  the  performance 
capability  needed  to  simulate  all  but  Fig.  111-2.  However,  we  can  manage  even  this  case  by 
means  of  the  trick  shown  in  Fig.  111-6.  Comparing  Figs.  111-2  and  111-6  we  see  that  the  effects 
of  tandeming  I, PCs  plus  the  effort  of  several  simultaneous  inputs  into  an  LPC  analyzer  are 
correctly  simulated.  Thus,  four  DVTs  plus  the  FDP  can  handle  all  situations  we  have  described 
I up  to  and  including  a 4-way  t,PC  conference. 

D.  RACKGROFND  NOtSF 

We  should  first  realize  that  different  kinds  of  background  noise  create  different  problems, 
and  must  be  dealt  with  accordingly.  The  background  noise  used  in  the  TSiF.  program  included 
; "office"  noise  {background  speech,  typewriters,  etc.),  airborne  command  post  (A13CP)  noise, 

' ship  noise,  and  helicopter  noise.  Before  discussing  the  TfcF  results,  it  is  worth  making  a few 

general  remarks. 


15 


1 


"Offire"  noisf  i^!  not  a ma  |oi'  probli’in,  in  uiir  opiniim,  siru-n  ilu'sc  noisn  son rnoK  are 
iisnallv  of  no  j;reater  \olunie  tliati  tiial  of  tin*  lalker  ami,  beint;  niiu  li  further  front  the  micro- 
phone, (i<i  not  Kreatlv  alter  tlie  input  SNl{.  lor  example,  the  l.ineoln  I !'(  suffereil  little  or  no 
loss  itt  intellipihilitv  and  qualitv  in  poinr  from  a quiet  to  an  office  environment. 

Helicopter  noise  causes  disastrous  itittd I ipihilit v atid  (pialitv  losses  in  the  tesietl  speetdt. 
This  noise  interferes  f«reatlv  with  the  excitation  .an;il\sis,  hut  also  mav  cause  larpc-  problems 
in  extratding  the  vocal-tract  ld'(  parameters.  In  oiq-  o|iiidon,  this  problem  should  he  treatml 
itt  a special  wav  dictated  bv  the  eti\ i ronment,  and  anv  fixes  shoulil  not  he  expected  to  work  for 
other  noisy  c.ases. 

AHCH  and  ship  noises  appear  to  be  more  .straifthtforw ard  types  of  noise  which  can  he 
modeled  as  colored  Gaussian  noise.  The  most  crucial  tispect,  we  think,  is  tin  lusion  of  the 
special  characteristic  of  the  tioise-cattcelinp  microphones.  We  are  qiute  cottviiueil  from  our 
own  experience.s  thttt  failure  to  ntatch  the  pre-em|)hasis  filter  and  or'  the  (luanfi'/atirm  levels  of 
the  I.f’f  parameters  to  the  microphone  causes  la  rpe  differeni  i'S  in  the  result.  It  is  known  that 
noise-i'ancelitig  microphoties  have  respcmse  cha ract erist i cs  that  are  ver\  differi'iit  li'ent  either 
carbon  or  close-talking  dynamii-  microphotu’s. 

t.  Summarv  of  K llesults 

Table  II1-4  summarizes  the  presently  avtiilable  T#jK  results  of  the  backgroutid  noise  test- 
ing of  the  Lincoln  I.IH'  system,  l or  the  2400-  and  <()00-bps  cases,  both  ship  and  AHGP  cases 
drop  Itelow  the  acceptable  OH  r score.  l or4K00bps,  intelligibility  holds  up  well  etiough  to  be 
acceptable,  which  lends  credence  to  our  argument  that  quantizer  levels  for  1,1’G  parameters 
should  be  adjusted  according  to  the  microphone  used. 


TABLE  111-4 

DRT  SCORES  FOR  LINCOLN  LPC  FOR  VARIOUS  TYPES 
OF  NOISE  BACKGROUND 


Environment  | 

Microphone 

^ 1 

2400  bps  1 

3600  bps  I 

4800  bps 

Quiet  ] 

Dynamic 

83.7  I 

1 

86.1 

88.2 

Quiet  j 

Carbon  1 

1 

79.4 

81.6 

85.5 

Offi  ce 

! 

1 Dynamic 

i 82.6 

86.0 

1 85.8 

A3CP 

' Noise  canceling  , 

72,6 

73.7 

79.3 

Ship 

1 Noise  canceling 

1 70.1 

73.1 

i 81.3 

Helicopter 

1 Noise  canceling 

48.0 

1 

1 45.0 

1 

56.0 

The  table  shows  clearly  that  helicopter  noise  is  iti  a class  In  itself,  and  causes  such  severe 
speech-processing  problems  that  it  ought  to  be  treated  as  a separate  problem. 

F'or  all  the  noise  backgrounds  that  caused  significattt  degradation,  tioise  caticelittg  micro- 
phones were  used.  Unfortunately,  we  have  no  restilts  using  these  same  mii  rophotu  s itt  a quiet 
environment  so  that  it  is  not  clear  to  what  extent  the  lower  settees  were  caused  by  the  noise  or 
by  the  microphone. 


It 


(’oinnu'iits 


li  i.s  lUi'tit'iilf  to  pro|)os»'  solutions  to  proi>Iems  involvinc;  signals  in  noise'  «ithmjt  having;  a 
pood  deal  of  kno«  locipe  about  the  nature  of  the  noise.  I '.sinp  a Caussian  eolor-ed  noise  niodel 

4 

whieh  is  additi\f  to  the  sipnal,  MeAulav  proposed  how  the  rnaxiinum-likelihf)od  method  im- 
plicit in  ()uict  I.l’C  can  lie  extended  to  tliat  of  findinp  IJ’C  parameters  in  ;i  tioisc  ijackptrouiid. 

Noise  nuikes  it  \er\  difficult  to  desipn  a pood  voieed-un voiced  decision  alporithm.  Aside 
from  the  poor  resultant  speo<  h quality,  hackpround  noise  makes  "sileiice"  detection  very  diffi- 
cult hv  silence  we  mean  the  absence  of  speech.  Accurat(>  detection  of  silence  makes  fjossihle 
TASI-like  svsteins  with  lonsequent  savinps  of  as  much  as  i:\  or  1:1  in  communication  hatidwidih! 
rtius,  a fruitful  research  area  is  that  of  opplyinp  maximum-likelihood  methods  of  detec  tion  of 
speech  vs  noise. 

Summari/itip,  it  seems  that  hackpround  noise  in  I. PC'  can  be  combated  bv  a variety  of 
methods,  im  ludinp  proper  choice  of  microphone,  audio  pre-processing,  adaptive-quantization, 

;ind  i.iaximum-likelihood  techniques.  Accumulation  and  cateporization  of  a noise  data  base  is 
.an  important  ;tspect  of  such  work. 

I..  fllANNKl.  NOISi; 

There  :ire  two  important  aspects  to  the  investigation  of  channel  noise  effects  on  speech 
terminals.  Tor  one  set  of  situations,  sophisticated  modem  technology  allows  effectivelv  error- 
less transmission  until  the  noise  exceeds  a fairly  sharp  threshold.  In  this  case,  the  speech  : 

terminal  will  either  work  or  quit  working,  and  the  issue  of  the  sensitivity  of  the  particular  al- 
gorithm to  channel  noise  loses  importance.  In  another  set  of  situations,  it  may  be  impractical 
to  include  sophisticated  modems,  and  then  vulnerability  to  channel  errors  can  be  an  important 
feature  of  a .speech  processor, 

1.  TStE  Results 

'The  T&E  results  give  some  data  on  the  lowering  of  intelligibility  and  quality  scores  for 
channel  error  rates  of  1 and  h percent.  .A  general  statement  can  be  made  that  a 1-percent  error  ^ 

rate  causes  little  perturbation  of  the  LPC  scores,  whereas  5 percent  causes  a significant  and 
most  likely  unacceptable  degradation  in  both  intelligibility  and  quality.  Somewhat  mysteriously, 
the  performance  of  the  E400-bps  Belgard  channel  vocoder  suffered  little  loss  of  intelligibility  at 
5-percent  error  rate;  at  the  moment,  we  have  no  good  explanation  and  would  like  to  further  in- 
vestigate possible  differences  in  vulnerability  between  LPC  and  channel  vocoders.  .As  expected, 

CV'SD  systems  also  suffered  little  degradation  at  these  error  rates.  ^ 

i 

2.  Jamming 

From  a practical  point  of  view,  the  most  interesting  problem  involving  channel  errors  is 
the  problem  of  combating  a jammer  in  a line-of-sight  or  satellite  link.  In  our  opinion,  this 
situation  should  be  explored  in  the  context  of  both  an  adaptive  modem  and  an  adaptive  speech 
processor.  A mechanism  for  maintaining  voice  communications  over  a jammed  link  ns  long  as 
possible  is  shown  in  Fig.  111-7.  In  addition  to  the  spread  spectrum  modem  and  the  speech  pro- 
cessor, a technique  is  needed  for  probing  the  channel  and  estimating  the  instantaneous  channel 
capacity;  a set  of  algorithms  is  presently  being  developed  by  Goodman  ct  al.’’  to  do  just  this. 

The  set  of  speech  algorithms  already  developed  on  the  D\  T encompasses  data  rates  from  2.1 


17 


2 


‘.REL..H  ; 

adaptive  i 

adaptive 

• 1 

' PROCESSOR 

1 ; 

! MODEM 

t t 

] DATA  RATE 
CONTROL 

CMANNtL-  ^ 
CAPACITY  ^ 
ME  ASURL  ME  NTS  , 


I'* 


Two-way  Time -variable 
Channel  • — 


PROBING 

Signal  s 


l-'ij'.  MtMliod  of  ;i  i-h;mnel  to  i-stinintc  clinriru*!  cnpricity 

ami  <lvnaniionUv  adapting  ilic  speec  h pi’oca'ssor  to  the  availnhle  rate. 

4 -f  khps;  thus,  a variahle  ciata-rate  sourc  e-c-ocUn^  svst*'n\  is  prc?s<*nUv  a I'enlilv.  It  remaiT»s 
to  determine  appropriate  sariahle  rate  spread  spectrum  teehniciues  and  eoml)ine  all  elements 
into  a real-time  sinuilation  sv.stem. 

Selective  Cddiny  of  Parameters 

i'rom  a perc  eption  viewpoint,  we  expert  that  some  systems  parameters  cause  worse  deff- 
rndation  if  in  error  titan  other  paramett'rs;  for  example,  a pitch  error  in  l.PC  may  be  more 
harmful  than  a comparable  error  in  an  M’C  parameter.  I'lanafran  ' has  performed  just  per- 
ceptible difference  nteasurements  on  forntant  errors,  and  perliaps  a similar  study  would  be  in 
order  on  tite  effects  of  MH'  paramctc'r  errors. 

4.  Summary 

Hather  than  trv  to  make  speech  coders  less  vulnerable  to  chnttnel  noise,  it  seems  mor(’ 
useful  to  reduce  channel  errors  hv  redundanc’v  coding  tec  hniques,  noth  approaches  presumably 
rectuire  extra  hardw'are,  but  codiit^  tec'hniques  are  a fullv  devedetped  lield  so  that  no  new'  alj^o- 
rithms  need  be  developed,  l.'.vperiments  on  selectiv  e codinj^  of  paramc'ters  are  still  worth  per- 
forming; a good  start  would  be  comparison  of  pitch  vs  spectral  sensitivity  in  either  vocoders 
or  I. PC’  svstems.  l inallv,  the  prov<*n  fact  that  (he  in"I'  is  a flexible  and  easily  adaptive  speech 
processor  plus  the  ongoing  variable  channel  capacitv  work  make  it  verv  tempting  to  develop  a 
simulation  facility  for  a speech  terminal  which  adapts  well  to  fammi/jg. 

K.  TKLKPIKINK  SPKKCll 

It  has  been  known  for  many  ve.irs  that  speech  that  has  been  transmitted  through  a teleplionc 
channel  prior  to  vocoding  causes  verv  signiiicant  degradation  ol  the  v’ot'oded  speech.  Admittedly’ 
this  is  a vagut'  statement,  sinte  telephone  chmnels  have  great  variability,  trying  to  find  a 
common  formulation  for  the  properties  of  (elep.a'ue-wire  lines  is  a tornii(iabIe  undertaking; 
we  have  been  lucky  enough  to  have  recedved  detailed  information  on  telephone-line  modeling 
through  the  offices  of  Ronald  Sonderttgger  of  Dl'f.l  and  i aptain  [,einon  o(  the  Departnient  of 
Defense.  Armed  with  this  information,  Seneff^  has  sueceedcii  in  implementing  a real-time 
simulation  on  the  DVT.  This  allows  us,  bv  concatenating  two  DV'I's,  to  do  real-time  listening 
of  I, PC  with  telephone  input.  loiter  portions  of  this  report  describe  this  experimental  configu- 
ration in  more  detail. 


1 H 


i 


Hi;i'i:m:NCKS 


1.  .1.  Makhoul  and  |{.  Viswanathan,  "Quantization  I’ropnrties 

of  Transmission  Parameters  in  I. inear  I'redirtive  Systems," 
Heport  No.  2H00,  Holt,  Heranek  and  Newman  ( April  p17.11. 

2.  S.  Senel'f,  private  communication. 

3.  .l.W.  Korgie,  private  communication. 

4.  R.  Alc.Aulav,  private  communication. 

5.  .Semiannual  Technic.il  Summarv,  Information  Processing 
Techniques  Program,  Lincoln  Latioratory,  M.I.  T.  (31  Decem- 
ber 1475),  DDf  AD- 1301 01  (.7- 1.. 

6.  .1.  I,.  Flanagan,  Speech  .Analysis,  Synthesis  and  Perception 
(Springer- Verlag,  Berlin,  1972). 

7.  S.  Seneff,  "A  Real-Time  Digital  Telepltone  Simulation  on  the 
Lincoln  Digital  Voice  Terminal,"  Teclinical  Note  1975-65, 
Ijncoln  I,aboratorv,  M.I. 7'.  (30  Decemlier  1975), 

DDC  AD-A021409/H. 


19 


FPffiCUDIMG  PAOE  ELAMC-HOT  EILI^ED 


(V.  Tin:  tki.ki'honk-i.im;  .siMri..AT(;u 

A.  OVEliVIKVV -IJESC'KII'TION  OK  A TELEPHONE  CHANNLI. 

In  a typical  telephone  channel,*  the  signal  is  first  filtered,  then  modulated  uj)  to  some  carrier 
frequency,  transmitted  witli  multiplexing  through  a series  of  cables  and  repeaters,  and  finally 
filtered  and  demodulated  at  the  receiver.  Various  distortions  are  introduced  into  the  signal  at 
all  parts  of  the  system.  The  filters  pass  only  energy  between  300  and  3000  l(z,  and  introduce 
both  amplitude  and  phase  distortion  in  the  pass  band.  Tiie  modulation-demodulation  process 
introduces  quadrature  distortion  whenever  there  is  an  offset  Ijetween  the  modulating  and  demod- 
ulating frequencies.  C'rosstalk  is  a result  of  both  frequency  and  space  division  multiplexing, 
whenever  the  filtering  and/or  shielding  are  inadequate.  In  traveling  down  the  cable,  the  signal 
becomes  attenuated;  tlicrefore,  tlie  repeaters  are  necessary  to  amplify  it  at  various  [<oints  in 
the  transmission  lines.  The  repeaters  are  a source  of  both  Gaussian  noise  and  nonlinear  distor- 
tion. When  a signal  is  switched  from  one  carrier  to  another,  a sudden  change  in  the  phase  rela- 
tionship is  introduced  as  well  as  transient  (impulse)  noise.  Aiiother  source  of  impulse  noise  is 
lightning  and  corona -type  discharges. 

B.  SINGLE-SIDEBAND  (SSB)  MODULATION 

Typically,  voice  signals  are  frequency-multiplexed  over  a telephone  line  using  SSB 

2 3 

suppressed-carrier  (SSBSC)  amplitude  modulation,  ’ as  shown  in  1 ig.  IV-l(a-d).  In  order  to 
set  the  stage  for  our  subsequent  discussion  of  phase  distortion,  SSBSC  is  reviewed  in  this 
section. 


i 

1 

1 j 

^transform  of  f(t)j  (w 

1 (fron*forn>  of 

1 

(□) 


Fig.  IV-1.  SSB  modulation-demodulation 
technique,  (a)  Modulated  up;  (b)  bandpass- 
filtered  to  exclude  frequencies  below  wci 
(c)  modulated  down  at  receiver;  (d)  low- 
pass-filtered  to  restore  F(a;). 


(d) 


TiTMtin) 


21 


FRECKDIMG  PAdE  ELAiaC-MOT  EIU'IED 


The  signal  is  first  bandpass-filtered  from  300  to  3000  \ \z  and  tlien  modulated  up  by  a carrier 
cosine  wliose  frequency  u)^,  is  a multiple  of  4000  Hz.  The  resulting  signal  is  then  bandpass- 
filtered  to  eliminate  any  energy  below  Since  the  original  signal  was  real,  this  entails  no 

loss  of  information.  The  signal  is  tlien  frequency-multiplexed  with  several  other  signals  at  dif- 
ferent carrier  frequencies  (also  multiples  of  4 kHz)  and  is  detected  at  the  receiver  by  means  of 
a bandpass  filter  from  to  ±(u)^  + 4000  Hz),  followed  by  demodulation  using  a cosine  also  at 
frequency  However,  no  attempt  is  made  to  assure  that  the  modulating  and  demodulating 

cosine  waves  are  in  phase  with  each  other.  In  fact,  tlie  modulating  and  demodulating  frequencies 
are  rarely  exactly  the  same,  and  any  slight  difference  in  frequency  will  cause  a gradual  drift 
into  and  out  of  phase. 

When  the  modulating  and  demodulating  cosines  arc  exactly  90  out  of  phase,  the  resulting 

4 

received  signal  is  the  quadrature  component,  or  Hilbert  transform,  of  the  original  signal.  The 
quadrature  component  can  be  understood  from  several  different  viewpoints,  if  one  represents 
the  original  signal  f(t)  in  terms  of  its  Fourier  expansion,  i.e,,  as  a sum  of  cosines  at  different 
frequencies  with  differing  amplitudes  and  phases,  then  the  Hilljert  transform  f(t)  could  be  con- 
structed by  converting  all  the  cosines  to  sines  and  adding  them  up.  An  impulse,  for  example, 
is  a sum  of  cosines  all  in  phase  at  e = 0,  at  that  point  in  time  when  the  impulse  occurred.  If  all 
these  cosines  were  converted  to  sines,  every  function  would  Ije  zero  at  the  time  of  the  impulse, 
all  would  be  negative  immediately  before  it,  and  all  would  be  positive  immediately  after.  Some 
distance  away  on  either  side,  the  sines  would  be  at  random  phase  with  respect  to  one  another, 
and  hence  the  net  sum  at  amy  point  would  be  0.  The  result  is  the  function  shown  in  Fig.  IV -2. 

One  way  to  acquire  tlie  Hilbert  transform  is  to  convolve  the  signal  with  the  function  of 
Fig.  IV-2.  Another  way  is  to  take  tlie  Fourier  transform,  multiply  positive  frequency  by  -j  and 
negative  frequency  by  +j  and  take  the  inverse  Fourier  transform.  It  is  clear  that  such  a filter 
would  convert  a cosine  at  any  frequency  u)  into  a sine,  since  tlie  transform  of  a cosine  is  tw'o 
positive  real  impulses  at  ±o)  and  the  transform  of  a sine  is  a negative  imaginary  impulse  at  +o.', 
and  a positive  imaginary  impulse  at  — w • [(e^'^  — e ^'*’)/2j  = sinw). 

Supposing  one  had  available  a complex  signal  z(t)  whose  spectrum  was  zero  for  negative 
frequency  and  identical  to  the  spectrum  of  a given  real  signal  f(t)  for  positive  frequency.  Accord- 
ing to  the  previous  discussion,  such  a function  could  be  generated  as  follows: 

z(t)  = 1/2  [f(t)  + jf(t)l  (Fig,  tV-3)  . 


If  one  were  to  multiply  z(t)  by  e 


the  transform  of  the  resulting  function  would  be  simply 


Z(uj)  translated  up  by  a frequency  co^,  i.e.,  Z(w  — w^).  Now,  were  one  to  take  tlie  real  part  of 
this  signal  and  determine  its  transform  the  result  would  be  the  even  part  of  Z(w  — which 
(Fig,  rV-4)  is  clearly  the  same  as  what  would  be  obtained  by  multiplying  i(t)  by  cosw^t  and  filter- 
ing out  energy  below  w Remembering  that  the  even  part  of  a transform  is  the  transform  of 
the  real  part  of  the  function,  we  have: 

fg(t)  = 2 Re[z(t)  e 


= Re  j[f(t)  4 jf(t)l  e " j 
= [f(t)  cosu'pt-f(t)  sinw^tl 


I 


I 


li 


Now  it  should  be  straightforward  to  explain  the  effei.  t of  a phase  differeiiee  between  the  modulat- 
ing and  demodulating  carriers; 

f^(t)  (f(t)  coso-'^t  — f(t)  sina-^,t]  cos  + <;  ) (l.ow  Pass) 

- fit)  cosw^t  cos  (w^t  1 <r)  - f(t)  sina^.t  cos  (a^_t  ' </?)  (l.ow  Pass) 

f(t)  cos  a t cos  a t cose)  — f(t)  cos  a t sin  a t siiito 
c c c c 

- f(t)  sina^t  coso.^t  cos  o • f(l)  sina^  t sina^.t  sin<;)  (l.ow  I’ass) 

Recalling  that 

cos^O  - 1,/2(1  ‘ cos  20) 

sin^G  = 1/2(1  - cos  20) 
sinO  cosO  - l/2(sin2o) 

and  that  tlic  sin  20,  cos  20  terms  will  ;dl  be  ('liminateil  by  low-|)ass  filtering  at  the  output,  we 
have  finally 

f^(t)  ■ l/2jf(t)  cost,'’  ’ f(t)  sin  cl) 

When  0 - 90°,  f^(t)  - l,/2f(t),  which  is  to  say  tliat  only  tlie  quadrature  term  is  received. 

C.  DISTPRHANrPS  IN  THK  TR.ANSMISSION  STP]\1 
1.  Quadrature  Distortion 

Armed  with  a Icnowledgc  of  the  mati  cmatics  of  SSH  carrier  systems,  it  is  now  simple  to 
understand  the  various  phase  distortions  lliat  occur  in  the  telcpiione  systems.  If  there  is  a fre- 
quenev  difference  Au.  between  tlu-  mo<iulating  ami  demodulating  waves,  wo  would  have; 

f^(t)  (t)  cosa^A  - f(t)  sina^  t|  cos[(a^  t Aa)  t|  (Low  Pass) 

The  contribution  to  the  phase  differi-nce  0 by  the  fr'cquency  offset  a is  a function  of  time; 

0^^  (A.-  ) t 

which  accounts  for  a slow  drift  into  and  out  of  [ihasc. 

Phase  jitter  is  tlie  term  used  to  describe  the  interference  of  a sinusoidal  noise  (frequently  | 

60  cycles)  in  the  phase  of  a carrier.  Sixty-cycle  interference  shows  up  everywhere  in  the  sys- 
tem but,  in  general,  is  harmless  because  of  the  300-Ilz  lower  limit  of  the  filters.  However, 

60-cycle  interference  in  the  generation  of  a cosine  causes  a phase  distortion  of  the  cosine  wave- 
form which  appears  as  a vcry-low-frcqucncy  modulation  (Pig.  IV-5).  The  phase  jitter  intriKluces 


I 'ig.  I\  -9.  Cosine  wave  at  312  cycles / fsec 
distorted  by  phase  jitter  at  60  cycles/sec, 

30  pcal<-to-|)eal\. 

i 

I 

I 

I \ 

J 


an  additive  component  to  the  phase  which  can  be  expressed  as 
cos  u;  jt 

where  A-  is  the  amplitude  of  the  jitter  in  degrees,  and  w . is  usually  60  cycles  dO  cycles  in 
3 

Europe). 

The  final  phase  effect  is  phase  hits,  or  sudden  large  shifts  in  phase,  which  occur  as  a re- 
sult of  switching  two  carrier  supplies  not  in  phase.  Sonietimes  a phase -coherent  detector  cor- 
rects the  situation  some  time  later.  On  other  occasions,  the  original  phase  relationsliip  is 
never  restored. 

The  net  phase  relationship  is  a combination  of  the  three  effects; 
yj  = + <Pj  + 

and  the  final  output  signal  is 

f^(t)  = f(t)  cos  (p  + f(t)  sini/) 

2.  Filtering 

A major  part  of  the  distortion  in  a telephone  signal  has  to  do  with  the  frequency-response 
and  phase-delay  characteristics  of  the  filter  used  to  assure  tliat  the  signal  remains  in  band. 
Theoretically,  in  multiplexing  at  4000-lIz  intervals,  one  could  pass  all  frequencies  from  0 to 
4000  Hz  and  avoid  crosstalk.  In  practice,  filters  never  cut  off  sharply,  and  so  a healthy  com- 
promise is  to  pass  the  signal  from  about  300  to  3000  Hz.  In  addition,  there  tends  to  be  consid- 
erable noise  interference  at  low  frequencies  (for  example.  60-cycle  hum)  so  that  it  is  advanta- 
geous from  a SNR  argument  to  remove  low  frequencies.  The  characteristics  of  the  fUter  in  the 
pass  band  represent  a trade-off  between  cost  and  response.  In  particular,  not  much  attention 
was  paid  to  phase-delay  distortion,  as  the  ear  is  insensitive  to  phase. 

3.  Nonlinear  Distortion 

In  the  process  of  traveling  down  a coaxial  cable,  a signal's  energy  is  gradually  attenuated 
due  to  losses  in  the  transmission  line  and  it  is  therefore  necessary  to  amplify  the  signal  at  cer- 
tain points  in  the  trainsmission  path  to  assure  an  adequate  level  at  the  receiver.  The  devices 
which  achieve  the  amplification  are  called  repeaters,  and  are  generally  made  up  of  two-port  am- 
plifiers. The  voltage -transfer  characteristics  of  a generalized  two-port  are  shown  in  Fig.  I\'-6. 


Fig.  IV -6.  Nonlinear  voltage 
transfer  characteristics  of 
two -port. 


25 


I 


According  to  a power  series  expansion,  ttie  transfer  i liaracteristics  can  lie  expressed  as-. 

2 -i 

c ■ a.e.  t a .e.  • a,e,  4 . . . 

o 1 i 2 i 3 l 

whore  a^  is  the  gain,  ami  the  other  a's  arc  tlie  nonlinear  distortion  coefficients. 

The  souix'Cs  of  tlie  nonlinear  liistortions  arc  several,  and  much  lias  Ix^en  wi  itten  on  the  sub- 
ject of  repeater  design  (see,  for  example.  Kef.  1,  pp.  39f  -421).  The  nonlinearities  in  a repeater 
are.  in  generid,  frequency-dependent,  and  tiie  calculation  of  tiie  precise  i-esponse  for  a given 
system  design  is  quite  complicated. 

4.  Gaussian  and  Impulse  \oise 

Gaussian  noise  is  an  unavoidalile  side  product  of  networks  and  devices.  Two  common  types 
of  noise  in  a circuit  are  thermal  and  sliot  noise  (see  Kef.  1,  pp.  1S1-1C4).  Arguing  from  the  cen- 
tral limit  tlieorem,  botli  can  be  assumed  to  be  Gaussian  and  wliitc  at  the  source.  Tliermal  noise 
is  present  in  any  conductor  (for  example,  a resistor)  and  is  due  to  tlie  thermal  interaction  be- 
tween free  electrons  and  vibrating  ions.  Its  available  power  is  directly  profxirtional  to  the  /irod- 
uct  of  the  bandwidth  of  the  system  and  the  temperature  of  the  source.  Shot  noise  is  due  to  the 
discrete  nature  of  electron  How  and  is  present  in  most  active  devices  (transistors  and  diodes). 

Its  amplitude  is  proportional  to  the  square  root  of  the  current,  and  tlierefore  is  dependent  on 
signal  level. 

The  components  of  the  repeaters  arc  a major  source  of  the  Gaussian  noise.  By  the  time 
the  noise  reaches  the  receiver,  it  is  no  longer  white  because  it  has  been  shaped  by  the  filters 
at  demodiilation. 

In  addition  to  Gaussian  noise,  there  is  occasionally  a burst  of  noise  on  a line  whose  ampli- 
tude far  exceeds  the  average  noise  level  of  the  system.  Such  noise  is  referred  to  as  "impulse 
hits"  and  is  generally  caused  by  switching  transients  in  central  offices  or  from  corona-type  dis- 
charges (electrical  discharges  in  the  air  surrounding  a high  potential  line)  that  occur  along  a 
repeated  line. 

5.  Echo  and  Grosstalk 

To  be  complete  in  a report  on  telephone-channel  characteristics,  one  must  include  at  least 
a brief  discussion  of  echo  and  crosstalk,  oven  though  these  two  effects  were  not  included  in  the 
digital  simulation.  An  echo  may  be  produced  whenever  there  is  an  impedance  discontinuity.  It 
is  most  commonly  caused  by  a return  of  a talker's  signal  through  the  channel  to  which  he  is  lis- 
tening. For  short  distances,  the  effect  is  indistinguishable  from  side  tone  and  is  therefore  not 
annoying.  Echo-suppressor  circuitry  has  been  added  to  long-distance  lines  to  attenuate  the  sig- 
nal on  the  return  path  when  the  received  signal  amplitude  is  high.  The  result  is  that  when  both 
people  talk  only  one  is  heard,  and  sometimes  the  beginnings  of  words  are  clipped.  There  is, 
however,  a "tone -disable r"  section  of  the  echo-suppressor  circuit  which  permits  the  mechanism 
to  be  shut  off  when  data  are  being  transmitted. 

Crosstalk  is  caused  by  coupling  losses  between  two  active  circuits.  Transmittance  cross- 
talk occurs  because  of  inadequate  design  of  modulators  and  filters  in  frequency  multiplexing. 
Coupling  crosstalk  is  caused  by  electromagnetic  coupling  Ix'twcen  two  physically  isolated  cir- 
cuits. The  result,  of  course,  is  that  the  listener  hears  anotlier  conversation  in  the  background, 
which  may  or  may  not  be  intelligible. 


2(, 


I).  MKASi’KiNc;  I'lii:  ri;i.i:riioN'K  c iiannki. 


Hol'oi-e  tlio  lolephonc  cliannel  could  be  simulated,  it  was  necessary  to  make  extensive  mea- 
surements on  thousands  of  lines  in  order  to  determine  concrete  values  to  be  used  in  the  various 
components  of  tlie  simulation,  from  the  study  data  set,  histofjrams  were  compiled  for  all  the 
various  parameters,  such  as  frequency  offset  and  level  of  Gaussian  noise,  for  Gontinental  I .S. 
It  onus)  and  i;uropean  voice-  and  data-grade  lines.  Table  I\'-l  is  a chart  f)f  the  resulting  num- 
bers for  tlie  various  degradations  (excepting  filter  frequency  response).  The  tei-m  "mid"  is 
useil  to  refer  to  the  SOth  percentile  on  the  histogram,  and  "poor"  refers  to  the  90th  [lercentilc; 
i.e.,  00  percent  of  the  Conus  voice-grade  channels  measured  were  better  than  "Conus  poor 
voice." 


TABLE  IV-1 

IMPAIRMENTS  OF  THE  EIGHT  TELEPHONE 
CHANNELS  SIMULATED 


Channel 

Simulated 

1 

Phase  ; 

Hits  ‘ 

1 

Frequency  j 

Offset  1 

(Hz)  1 

' 1 

Phase 

Jitter 

^ 1 

Harmonic  j 
Distortion 
(dBmc) 

1 

Gaussian  j 
Noise  ! 
(dBmc) 

Impulse 

Noise 

Conus 

poor 

voice 

45/1  5 min.  ! 
at  17° 

1 

15°  p-p, 
60  Hz 

-20 

-40  i 

None 

Conus 

mid 

voice  1 

1 5/1 5 min. 
at  17° 

0 

11.6°  p-p,  , 

60  Hz  i 

, 1 
i 

-25 

-46 

j 

None 

Conus  1 

poor  1 

data 

1 45/15  min. 
' at  1 7° 

I 

! 14°  p-p, 

60  Hz 

1 i 

-20 

-37 

! 

1 None 

j ' 

Conus 

mid 

data 

5/15  min. 
at  17° 

0 

! n°p-p, 
i 60  Hz 

j 

-28 

-45 

None 

European 

poor 

voice 

405/1 5 min. 
at  42° 

j 

7 

! 35°  p-p, 

50  Hz 

i 

-37 

-39 

225/15  min., 
74  dBmc 

European 

mid 

voice 

45/1 5 min, 
at  32°  ; 

3.5 

26°  p-p, 

j 50  Hz 

-43 

25/15  min,, 
74  dBrnc 

European 

poor 

data 

1 35/1 5 min. 
at  42° 

6 

35°  p-p, 
50  Hz 

-37 

i -39 

225/15  min., 
74  dBrnc 

European 

mid 

data 

1 35/1 5 min. 
at  22° 

2.7 

18°  p-p, 

50  Hz 

-43 

-45 

i 

25/15  min., 
74  dBrnc 

Gaussian  noise  and  nonlinear  distortion  were  measured  in  units  of  dlimc,  meaning  deciliels 
relative  to  1 milliwatt,  using  a special  C'-message  weighting  curve  (see  Hef.  1,  pp.  3t-}4)  shown 
in  Fig.  IV -7.  This  frequency-weighting  curve  was  determined  empirically  by  means  of  several 
subjective  listener  tests,  and  is  intended  to  reflect  the  amount  of  listener  annoyance  for  noises 


100  200  500  1000  2000  5000 

FREQUENCY  (Hj  | 

at  lUfferent  frcqui'iicios.  The  unit  of  nieasurtuiu  nt  used  for  impulse  iioi.se  is  (IHrnc,  where  " rn" 
staiid.s  for  reference  noise  wliicti  i.s  — ^0  ilHni,  ru-  10  ' U at  1000  llz. 

In  addition  to  the  t'-messaye  weiftlitiiif*  curve,  tlie  device  used  by  the  telephone  eompatiy  to 
measure  Gaussian  noise  lias  a built-in  mechanism  for  attenuating  imiiulsc  noise  of  a sufficiently 
shoi't  duration.  The  argument  again  i.s  one  of  ear  sensitivity,  as  the  human  ear  does  not  fully 
perceive  the  [lower  in  noises  of  less  than  200-m.sec  duration. 

The  teie|)hoiie  company  has  available  another  device,  railed  an  imjiulse  counter  (.see  Hel.  1, 
p.  17  1).  for  the  purposes  of  measuring  the  amount  of  impulse  noise  present,  as  .such  noise  i.s 
far  more  destructive  than  Ganssian  noise  in  introiiucing  errors  in  the  transmission  of  bits 
through  a modem  and  a channel.  The  counter  consists  of  a weighting  network,  a rectilicr,  a 
threshold  detector,  and  a counter  of  events  above  threshold.  Sevei’al  impulse  counters  can  be 
used  simultaneously  at  different  thre.sliolds  to  obtain  information  about  the  di.stribution  of  the 
magnitudes  of  the  impulses. 

To  measure  tlie  nonlinear  distortion  (see  Hef.  i,  pp.  238-239),  a pure  tone  at  a fixed  level 
is  transmitted,  and  the  amount  of  energy  rei-eived  at  tlie  second  and  third  harmonics  is  mea- 
sured. From  these  values,  one  can  deduce  the  coeffieients  ancl  a^  of  the  squared  and  cubed 
terms; 

eos‘‘t)  - 1/2(1  ( eos  2o) 

cos^n  “ cos O eos  2(1  - l/2(eo.sO  f cos  O co.s  2(1) 

= l/2  cost)  -1  l/-l(eos30  t cost') 

Because  of  this  relationship  between  nonlinearities  and  harmonies,  nonlinear  distortion  i.s  often 
referred  to  as  liarmonic  di.stortion. 

E.  SIMI'I.IFICATIONS  I SEIJ  1,\  TIIK  SEMI  I.ATION 

The  parameters  of  ttie  simulation  were  set  so  as  to  emulate,  as  closely  as  possible,  the 
results  found  from  the  statistical  study,  in  the  case  of  certain  measures,  such  as  frequency 
offset  and  phase  jitter,  this  was  a straightforward  process.  However,  in  tlie  simulation  ol 
Gaussian  noise,  impulse  liits,  phase  hits,  and  nonlinear  distortion,  certain  (sometimes  gross) 
simplifications  were  used. 

Tlie  numbers  used  for  Gaussian  noise,  imp\ilse  noise,  and  nonlinear  distortion  in  the  sim- 
ulation were  determined  empiriealiv  bv  adiiisfing  constant.s  until  the  desired  value  was  obtained 
at  the  output  hv  the  appropriate  metisiiring  device  tdi'.sr rihed  in  .‘lee.  I)  above). 


2k 


Gaussian  wliitc  noise  f>enerated  at  various  cleviies  in  tlie  transmission  line  is  no  longer  uliitc 
by  the  time  it  reaches  tlie  receiver  because  of  the  filter  transfi  r cliarai  teristics.  In  tlie  simu- 
lation, Gaussian  noise  is  added  at  the  input,  so  that  it  will  bc‘  shaped  by  the  filter  of  the  system. 
The  level  of  the  Gaussian  noise  in  the  telephone  system  is  ilependent  upon  the  ani[)litude  of  the 
input  signal,  wliereas,  in  the  siimdution,  noise  is  kept  at  a fixed  lei/  J. 

Impulse  and  pliase  hits  are  simulated  to  occur  at  fixeii  intervals  and  at  fixed  duration  and 
level,  so  as  to  correspond  in  frequency  of  occurrence  and  average  am|>litude  to  the  given 
tele  phone -channel  characteristics. 

Nonlinear  distortion  is  simulated  by  squaring  and  cubing  the  final  out|iut  of  the  system.  The 
coefficients  a,  and  a^  remain  fixed  anti  are  equal,  even  though  in  the  actual  telephone  system 
these  two  coefficients  would,  in  general,  be  dependent  on  Ixith  amplitude  and  frequency.  The 
appropriate  gain  for  the  squared  and  cubed  terms  was  of  course  determined  so  as  to  matcli  the 
desired  meter  measurement.  This  was  done  by  sending  a pure  tone  at  700  ||z  through  the  dig- 
ital distorter  and  measuring  the  output  of  a notched  filter  (to  remove  700  Ilz)  in  dllrnc. 

F.  IMin.KMKNTATION  ON  TIIF  l.DVT 

The  l.DVT  is  a small  high-speed  computer  which  has  proven  cajiable  of  realizing  in  real  time 
a variety  of  algorithms  for  low  bit-rate  digital -speech  transmission^  (from  2400  to  10,000  bps). 
The  computer  consists  of  a 1024-word  program  memory  Mp  with  a cycle  time  of  55  nsec,  a 
512-word  data  memory  Md,  and  a 204b-word  outboard  memory  lUx,  an  input/output  device  from 
which  data  are  accessible  in  a few  instruction  cycles.  The  l.DVT  has  a very  simple  instruction 
set  and  is  quite  convenient  to  program. 

The  telephone-channel  simulator  represents  the  first  attempt  to  use  the  l.DVT  for  something 
other  than  a vocoder.  A block  diagram  of  the  complete  simulation  is  given  in  Fig.  lV-8.  The 
input  speech  is  low -pass  filtered  to  remove  energy  above  5 kHz,  and  sampled  at  10  kHz  using  a 
12-bit  A/D  converter.  Impulse  and  Gaussian  noise  are  generated  digitally  and  added  to  the  in- 
put samples. 

The  frequency  response  and  delay  characteristics  of  the  telephone  filter  and  the  Hilbert 
transform  to  obtain  the  quadrature  component  arc  both  realized  by  means  of  high-speed- 
convolution  techniques.^  After  a 256-point  F'F’T  of  the  input  samples  (256  points  is  adequate  as 
long  as  the  combined  impulse  resjionse  of  the  filter  and  Hilbert  transform  is  less  tlian  12.8  msec), 
negative  frequency  is  zeroed  out  and  positive  frequency  is  complex-multiplied  by  the  Fh'T  of  the 
desired  impulse  response.  A 256-point  Il'FT  now  yields  the  filtered  speech  f(t)  in  the  real  data 
buffer,  and  the  filtered  quadrature  component  f(t)  in  the  imaginary  data  buffer  (refer  to  the  pre- 
vious discussion  on  SSH  modulation  in  .Sec.  B above).  Only  the  second  half  of  the  data  is  good, 
because  of  circular  convolution;  therefore,  an  overlap  of  128  samples  is  necessary  between 
frames. 

The  phase  y;  is  computed  as  the  sum  of  the  contributions  of  phase  jitter,  frequency  off.set, 
and  phase  hits.  The  output  of  the  "modulator-demodulator"  complex  is  then  computed  as; 

g(n)  r f(n)  cos  y>  + f(n)  sinyi 

The  final  step  in  the  simulation  is  to  subtract  from  g(n)  the  nonlinear  distortion  term 
s(n)  = g(n)  - D [g^(n)  + g^(n)) 


29 


Fig.  IV -8.  Block  diagram  of  telephone -channel  simulator. 


DATA  l-'OI!  a'.MtS  l’(X)U  U)U  I 


1)12103 

OO.V720 

riisKTi; 

201)0. 

I’FRIOll  HI  TIM. IN  I’lltSI  mis  IN  .01 

. sec  MM  IS 

012404 

onouoi 

I’liSDim 

1 

OURAIION  1)1  PIIA.SI  mis  IN  .01.  s 

ec  IIS  ns 

012405 

000001 

FKQOFF 

1 

I RKIUI  .N('V  1)1  1 Sl.l  IN  112 

01240b 

00001'' 

I’KIOI’K 

15. 

DIC.RI  1 S I’l.Ak  TO  PI  AK  ol-  .11  TII  R 

012407 

0000"4 

.HI'TFK 

bO. 

112 

012410 

oooo:i 

ITISlllT 

17. 

DFT'.RI  l-.S 

012411 

0504:’5 

IIAKM 

.b.34  7. 

IIARMO.MC  DIs  FOR  I ION  FAC  FOR 

012412 

OOl.^JO 

OAUSMl. 

.022. 

NF.IV  SIOMA  = (;,MISMI.' .25. 

0I24I3 

000000 

IMl’KTi; 

0 

PI.RIOD  KFTKFF.N  IWWSI  HITS  IN  . 

01.  sec  IIMIS 

012414 

000(100 

IMl’AMF’ 

.0. 

AMi’i.rmui  OF  I'lPUl.si.  mis 

012415 

000001 

IMPDUR 

1 

DllRAFlON  OF  IMPIll.SF.  IlliS  IN  .01. 

sec  UN  ns 

01241b 

000000 

I’KTFRC 

.0. 

FRACTIONAL  PART  OF  PKIOPK 

01 24  I’ 

000000 

OFFF'RC 

.0. 

FR.AC'I  lO.NAl.  I'AR'I  01  FRQOIT 

Fig.  IV -9.  Parameters  of  syntliesizer  reside  in  locations  403  to  417 
of  data  memory  Md. 

There  is  a special  buffer  in  Md  (data  memory)  which  is  set  aside  as  tlie  parameters  of  the 
system  (excepting  filter  frequency  response)  and  which  can  be  modified  under  user  control.  Fig- 
ure IV-9  is  a copy  of  that  section  of  the  program,  including  the  appropriate  settings  for  Conus 
poor  voice.  The  same  mnemonics  are  used  in  Fig.  IV-8,  where  it  should  be  clear  how  tlie  var- 
ious parameters  are  being  used. 

Table  IV -2  gives  tlie  values  used  for  the  various  parameters  in  the  simulation  of  the  eight 
channels.  The  values  for  the  parameters  FKQOFF,  I'KTOl'K,  JITTER,  PHSHIT,  PKTFRC, 
and  OFFF'RC  are  simply  copied  over  from  Table  IV-l.  The  values  for  PIISRTE  and  IMPRTK 
are  determined  as  the  reciprocal  of  No./l5  min.,  and  converted  from  units  of  minutes  to  units  of 
0.01  sec.  The  parameters  PHSDPR  and  IMPDUR  are  left  unspecified  in  Table  FV-l,  and  are  set 
arbitrarily  to  0.01  sec  in  Table  IV -2  for  all  channels.  The  parameters  GAI  SML,  HARM,  and 
IMPAMP  are  dimensionless  fractions  determined  empirically  so  as  to  read  the  correct  meter 
reading  at  the  output  in  units  of  dBmc  or  dBrnc. 

The  only  other  information  that  is  needed  to  simulate  each  of  the  channels  is  the  frequency - 
domain  characteristics  of  the  telephone  filter.  Tables  were  given  for  each  channel  of  2 56  real 
and  256  imaginary  frequency  components  of  the  desired  filter,  spaced  by  20  Hz,  spanning  0 to 
5 kHz.  Since  the  UV'T  implementation  used  a 256-  rather  than  512-point  FFT,  alternate  samples 
were  ignored  in  entering  the  tables  into  the  computer.  It  should  be  noted  here  that  it  would  have 
been  essentially  impossible  to  implement  a 512-point  I’FT  in  the  DVT  due  to  its  current  memory 
limitations,  but  that  the  12,8-msec  window  appears  to  be  adequate  for  the  filters  simulated. 

The  main  body  of  the  program  is  the  FFT-IFFT  computation  which  requires  several  buffers 
in  Mx  (outboard  memory),  two  128-point  buffers  in  Md,  and  a complex  interchange  of  data  be- 
tween Mx  and  Md.  The  program  takes  advantage  of  certain  characteristics  of  the  data  to  reduce 
both  time  and  memory  requirements.  By  keeping  the  F'FT  size  down  to  128,  one  avoids  the  ne- 
cessity of  referencing  Mx  in  the  inner  loop,  which  would  greatly  increase  the  time  required. 
Therefore,  the  forward  F'FT  is  realized  (since  the  input  data  are  real)  by  packing  even-numbered 
samples  in  the  real  Md  buffer  MDREAL,  and  odd-numbered  samples  in  the  imaginary  Md  buffer 
MDIMAG,  and  doing  a 128-point  F'FT  followed  by  even-odd  separation  and  a final  stage  of  the 
256-point  FF'T.  The  first  stage  of  the  inverse  FFT  is  nonexistent,  since  the  second  half  of  the 
data  is  zero,  and  therefore  it  can  be  conveniently  split  into  two  128-point  F FTs,  with  the  first 


j 


i 


VALUES  USED  FOR  THE  PARAMETERS  OF  THE  SIMULATOR  TO  CORRESPOND  TO  THE  DATA  IN  TABLE  IV-1 


£ 5 q! 


c o 

O O 

(J  o-  o 


C o 
O O o 
U o-  > 


in  o 

^ ^ o ^ 


one  vieldin^  llu‘  even -nunibc rod  output  samples  and  thi-  second  one  (t)s  iie^jinninfi  ouch  stage 
with  the  eoeffieient  index  set  at  tialf  its  increment  instead  of  at  itero)  yit  lding  the  <KJd-numt)er<d 
output  samples. 

liecause  of  tliis  ev'en-,  odd-numhered  sorting  at  iiotli  tlie  output  and  input,  it  is  convenient 
to  store  all  buffers  of  speeeli  in  Mx  in  ttiis  peeuliai-  fasliion  of  even-numl)ered  samples  in  one 
l)uffer  and  odd-num()ered  samples  in  another.  The-  buffers  required  in  Mx  are  listed  in 
Taiile  [\ -2.  The  i ven-numbered  input  samples  are  stored  in  lAlOand  tl]i-  odd-nurtiljered  in 
01)10.  At  each  A 'll  interrupt,  ttie  next  sample  is  fetciicd  out  of  either  KVIO  or  01)10  alternately 
and  sent  to  the  1),'A  converters,  and  tlie  new  sample  from  tlie  A/I)  converter  is  written  over  tlie 
same  place  in  Mx.  At  tlie  beginning  of  a new  frame,  the  most  ri-eent  12H  samples  from  the  A/D 
converter  have  filled  up  lA'IO  and  ODIO,  and  so  these  buffers  now  become  lANIAt  and  ODM’W, 
respectively.  .At  the  same  time,  the  former  f.VNI.’U  and  ()D\1.'.'.  become  lAt'J.Dand  OlXil.i), 
and  the  former  lA’Ol.D  and  01)01.1)  whose  contents  had  been  written  over  by  sdi)  in  tlie  course 
of  the  previous  fixime,  becomt'  l A lO  and  01)10,  ready  to  be  sent  to  the  l)/A  converter. 

The  first  step  in  a new  frame  is  to  fill  MDRKAI.  with  12h  even-numbered  new  samples  from 
K\  OI.I)/i  A'.V  KU  and  to  fill  MDI.MAO  w ith  the  odd-number('d  .samples  from  OIX  )1J  )/ODMAV  . 

Now  a 12h-[)oint  normal  to  bit-reversed  decimation  in  frequency  I IT  followed  bv  bit-reversed 
even-odd  separation  and  a final  stage  of  the  2‘ifc-point  1 I T can  ;ill  be  realized  in  place  in  Md. 

The  resulting  data  (the  first  12b  sample.s  of  the  2bf, -point  I IT)  are  then  complex-multiplied  liy 
the  bit-reversed  filter  coefficients  (wiiich  liav’e  been  stored  in  Mx).  Tlie  resulting  filtered  spec- 
trum is  then  saved  in  Mx;  tlie  real  data  are  stored  over  OI )OI  .1  )/l  A'01 .1),  and  tlie  imaginary 
data  are  saved  in  a 12b-poitit  bufler,  AIXIJUACI.  Ol  the  data  in  .Md,  a 12b-point  If  !■  1 at  this 
point  yields  the  even-nuniliered  fin)  sample.s  in  MDHKAI.  and  the  even-numbered  flu)  (quadrature 
component)  samples  in  MDIMAO. 

Only  the  second  half  of  the  data  is  good,  because  of  circular  convolution,  and  these  M even- 
numbered  f(n)  and  even-numliered  fin)  are  saved  in  .Mx.  Meanwhile,  tlic  original  real  and  im- 
aginary outputs  of  the  forward  I I T filtered  spectrum  arc  fetched  liack  from  Mx  and  another 
128-fxiint  IT  IT,  beginning  each  .stage  with  the  coefficient  index  set  at  half  its  increment  instead 
of  at  0.  yields  the  o<id-nuniliered  samoics  of  f(n)  and  of  fin)  in  Md.  .Again,  the  first  6A  of  each 
are  garbage.  The  tO  even-numbered  samples  of  f|ri)  and  of  f(n)  can  now  be  fctclied  bad;  from 
Mx  and  stored  over  the  64  garliage  points  in  eacti  of  MDHK.Al.  and  MDIMACi.  Now,  all  the  per- 
tinent data  rc.side  in  Md  and  all  tliat  is  left  is  the  introduction  of  quadrature  di.stortion  and  non- 
linear distortion; 

g(n)  : fin)  cos  + f(n)  sin«) 

s(n)  = g(n)  - I)  Ig'^(n)  t g^(n)] 

This  can  be  done  conveniently  at  this  time,  and  tlic  resulting  s(n)  evens  can  be  stored  over 
EVOLD,  and  the  s(n)  odds  over  ODOLD. 

When  the  interrupt  routine  has  been  serviced  128  times,  resulting  in  64  new  samples  in  eacii 
of  FATO  and  0I)(0,  it  is  time  to  swap  buffer  pointers  again,  leaving  sin)  evens  in  K\  lO  and  s(n) 
odds  in  ODIO,  as  de.sired,  and  we  have  come  full  circle. 

The  assurance  of  adequate  accuracy  without  overflow  in  the  IT- T-Il  l 'T  complex  at  first  posed 
some  difficulty,  as  a checlt  for  overflow  and  correction  in  the  inner  loop  proves  very  costly  in 
terms  of  time.  The  scheme  finally  decided  upon  was: 


41 


(1)  Shift  tlie  input  huffor  of  tlie  forward  !■  IT  in  Md  as  far  left  as  possible 
as  a block,  so  that  the  largest  sample  does  not  overflow. 

(2'  Scale  down  the  data  as  a block  by  l/2  at  eaeli  stage  of  tiu-  inner  loop 
of  the  forward  KFT. 

(3)  Scale  the  output  data  in  Md  from  the  forward  1-  I T as  far  left  as 
possible. 

Ha)  If  the  data  have  been  scaled  up  by  more  than  2^,  scale  them  down  by 
t/2  at  each  stage  of  the  ll'FT  and  scale  tlicm  down  tlie  remaining 
amount  at  the  end. 

(4b)  If  the  data  have  been  scaled  up  by  less  than  2^'',  scale  tliem  down  by 
1/2  at  eacli  stage  of  the  IFFT  until  such  time  as  no  more  scaling  is 
needed. 

The  remainder  of  the  program  is  reasonably  straightforward.  Gaussian  and  impulse  noise 
are  added  to  a new  s(n)  in  the  interrupt  routine,  as  soon  as  it  comes  in  from  the  A/D  converter 
and  before  it  is  sent  to  Mx.  Gaussian  noise  is  simulated  by  first  generating  a 9 -bit  pseudo- 
random number  with  a Hat  distribution.  Eight  bits  are  used  as  a pointer  into  a table  of  tlie  mid- 
points of  256  equal  areas  over  the  positive  half  of  a Gaussian  distribution,  and  the  ninth  bit  de- 
termines the  sign.  The  numbers  in  the  table  were  chosen  such  that  ct  = 0.2  5,  and  values  (when 
fetched  from  the  table)  are  multiplied  by  the  appropriate  constant  GAFSWl.  to  obtain  the  desired 
<T.  Impulse  noise  is  created  by  adding  to  each  s(n)  a signal  of  lieight  IMFAMP  for  the  duration 
of  time  IMFDI  FI,  and  at  a periodic  rate  IMPUTE. 

The  phase  <p  and  the  components  <p^,  and  are  all  stored  in  the  computer  as  fractions, 

with  1.0  corresponding  to  360°.  tp ^ is  updated  at  each  s(n)  by  adding  FRQOFF/10,000,  since  an 
offset  of  1 Hz  would  be  360°  in  10,000  samples.  Likewise,  is  updated  by  adding  JITTER/ 
10,000  at  each  s(n).  The  cosine  of  Wj  is  multiplied  by  PKTOPK/360  (expressing  degrees  peak- 
to-peak  as  a fraction)  to  obtain  <p y Phase  hits  are  simulated  by  adding  PHSHIT/360  to  ip  for 
those  samples  s(n)  during  the  interval  PHSDl'R,  and  spaced  by  the  time  PllSRTE.  The  three 
values  cosuij,  cosy>,  and  siny  are  determined  by  table  lookup  from  Mx  using  the  same  first- 
quadrant  64-point  cosine  table  as  is  used  for  the  FFTs,  plus  an  additional  interpolation  angle  of 
90°/l28  whose  cosine  and  sine  are  stored  in  Md,  Front  these  it  is  possible  to  determine  tlie 
cosine  and  sine  of  any  angle  from  0°  to  360°  to  the  accuracy  of  90/128°  using  the  formulas; 

cos  (a  + b)  = cos  a cos  b — sin  a sin  b 


sin  (a  1 b)  = cos  a sinb  + cos  b sin  a 

Harmonic  distortion  is  added  just  before  the  final  adjustment  of  the  data  from  block  floating 

2 3 

to  fixed  point  so  as  to  gain  bits  in  the  cubing  and  squaring  of  tlie  data.  The  terms  g (n)  and  g (n) 
are  added  together,  multiplied  by  tlie  parameter  HARM,  and  subtracted  from  g(n)  to  obtain  s(n) 
which,  when  converted  back  to  fixed  point,  is  the  final  output  of  the  system. 

A somewhat  intuitive  feel  for  the  effects  of  phase  jitter  and  frequency  offset  can  be  gained 
by  referring  to  Fig.  fV-lO(a-c).  This  figure  shows  a time  exixisurc  of  the  impulse  response  of 
the  telephone  filter  for  Conus  poor  voice  with  (a)  all  the  other  distortions  removed  from  the  sys- 
tem, (b)  only  phase  jitter  added,  and  (c)  only  frequency  offset  .added.  It  can  be  noted  that,  with 


'Film'  t*xposur<’  of  unit 
sani[)lf  rosponsi'  oi  filtor  for  Conus 
{x)or  voiuo  lin«“.  (a)  I'nit  sainplo  ru- 
sponsu;  (b)  sanu*  as  in  (a)  but  \sith 
pha.sf  (ittir  at  17®  fM'ak-Uj-poak, 
t'O  II/;  (rl  sanu*  as  in  (a)  but  with 
frcquom  v offset  betwemi  transmit- 
ter  cosine  and  vccuivt'v  cosim*  of 
1 Hz. 


\ ^ 


D ) 


a phasc-jittcr  interfcfcncp,  the  samples  "jitter"  about  a focal  point,  whereas  with  a fre- 
quency offset  there  is  a sniootli  motion  of  Uie  wave  as  it  appears  to  continually  pass  through  and 
disappear. 

G.  DVT  TlMl'iS  AXD  SI'Afi: 

Table  IV-1  is  a list  of  tlie  various  suliroutines  in  tlie  l.D\'T  in  their  approximate  order  of 
occurrence.  Included  for  each  subroutine  are  t)]C  amount  of  program  memory  required  and  tlie 
amount  of  time  required  for  its  execution.  The  total  time  needed  per  frame  is  sliglitly  more 
than  half  the  time  available,  and  the  program  uses  ujj  77  percent  of  Mp. 

Tiiere  are  only  512  locations  in  Md,  of  wliich  iialf  are  needed  for  the  128-point  Fl’T  real 
and  imaginary  buffers.  Fortunately,  no  other  large  buffers  arc  needed  by  the  program,  so  that 
an  additional  12  5 locations  were  needed  for  tlie  various  parameters,  temporaries,  and  variables, 
leaving  one -fourth  of  Md  unused, 

.As  for  outboard  memory  Mx,  2 56  locations  are  needed  for  the  filter  coefficients,  2 56  for 
the  Gaussian  table,  128  for  tlie  cosine  tables,  and  4 X 128  for  the  various  speech  buffers,  for  a 
total  of  1152  or  56  percent  of  Mx  (Table  rV-4). 


15 


GRAM  SUBROUTINES,  AND  TIME  AND  PROGRAM  MEMORY 


TABLE  IV-4 

BUFFERS  IN  OUTBOARD  MEMORY  Mx  NEEDED  FOR  TELEPHONE 

SIMULATION 

Buffer 

1 Allocation 

Buffer  Size 

EVIO 

Even-numbered  in-out  samples 

64 

ODIO 

Odd-numbered  in-out  samples 

64 

EVNEV( 

Most  recent  64  even-numbered  samples  of  s(n) 

64 

ODNEW 

Most  recent  64  odd-numbered  samples  of  s(n) 

64 

EVOLD 

Pte/ious64  even-numbered  s(n)  I 

64 

ODOLD 

D • _i_i  L _i  / \ also  os  Mx  Reol 

Previous  64  odd-numbered  s(n) 

64 

MXIMAG 

Temporary  storage  of  imaginary  FFT  output 

128 

GAUS 

Midpoints  of  256  equal  areas  of  positive  half  of  Gaussian 

256 

FLTR 

Bit-reversed  filter  coefficients  (128  Real,  1 28  Imaginary) 

256 

SINE 

Sine  of  0 to  tt/2  for  FFTs  and  for  computation  of  cosine  <p  . 

64 

RVSINE 

Bit-reversed  sine  of  0 to  tt/2  for  last  stage  of  forward  FFT 

64 

Total 

1 1 52  = 56  percent 

kefkkences 


1.  Transmitision  Systems  for  Communications.  4th  Edition, 
lieli  Telephone  Laboratories  (1970). 

2.  J.  M.  Wozencral't  and  M.  Jacobs,  l^rinciples  of  Communica- 
tiojl_ii>>gLnperinf»  (Wiley,  New  York,  1967),  pp.  504-508. 

3.  M.  Schwartz,  Information  Transmission  Modulation  and 

(McGraw-Hill,  New  York,  1970),  pp.  218-228. 

4.  Ti.  Gold,  A.  V.  Oppenheim,  and  C.  M.  Rader,  Theory _an<l 
Implementation  of  the  Discrete  Hilbert  Transform.  Sympo- 
sium on  Computer  Processing  in  Communications,  l^olytechnic 
Institute  of  Brooklyn  (Polytechnic  Press,  1970),  pp. 235-250. 

5.  Captain  R.  Lemon,  private  communication. 

6.  P.  E.  Blankenship  et  aL,  "The  Lincoln  Digital  Voice  Termi- 
nal System,"  Technical  Note  197  5-53,  l.incoln  Laboratory, 
M.I.T,  (25  August  1975),  DIK'  AD-A017569/5. 

7.  B.  CJold  and  L.  R.  Rabiner,  Theory  and  Application  of  Dig- 
ital Signal  Processing  (Prentice-Hall,  New  York,  1975). 


1 


V.  Till-;  HAHMONIC  PITCH  HHTHCTOK 


A.  IXTIIODHCTION 

All  |)itch  detectors  can  be  placed  in  one  of  two  categories  - time  domain  and  frequenc, 
domain.  Time-domain  pitch  detectors  deal  directly  with  the  speech  waveform,  and  a.s  .such 
are  relatively  fast,  since  very  little  preprocessing  of  the  signal  is  required.  Most  frequency- 
domain  detectors  require  an  abundance  of  time  and  memory  storage  to  obtain  the  spectral  in- 
formation over  a sufficiently  long  time  window  and  with  adequate  spectral  resolution.  These 
method.s  are  therefore  often  not  realizable  in  real-time  vocoder  implementations,  or  are  only 
realizable  at  the  co.st  of  excessive  quantization  of  the  pitch. 

Pitch  detection  is  generally  good  when  the  input  signal  is  intact  and  noise-free.  However, 
distortions,  filters,  and  noi.se  tend  to  obscure  the  pitch  information  and  cause  most  pitch  de- 
tectors to  break  down,  sometimes  severely.  .Since  in  the  real  world  the  signal  is  often  cor- 
rupted, we  felt  that  an  algorithm  designed  to  be  robust  against  degradations  would  be  a signifi- 
cant new  contribution. 

Wo  were  particularly  interested  in  coping  with  degradations  caused  by  (1)  pas.sage  of  the 
speech  through  the  public  telephone  system  prior  to  pitch  detection,  and  H)  acoustically  coupled 
noise  backgrounds.  From  a previous  effort,*  we  had  the  capability  to  simulate  in  real  time  the 
filtering,  phase  di.stortion,  phase  jitter,  and  nonlinear  distortion  effects  of  a telephone  system. 

In  addition,  we  had  available  test  material  w'herein  the  noise  background  of  a large  jet  airplane 
was  incorporated  into  the  recording. 

The  algorithm  thus  developed  is  a frequency-domain  technique  which,  however,  restricts 
itself  to  a selected  portion  of  the  frequency  band  below  1100  Hz.  Digital -signal-processing 
tricks  were  used  to  obtain  the  desired  spectral  region  with  minimal  computation  time.  Pitch 
is  determined  from  spacing  between  peaks  in  this  region,  using  an  iterative  method.  The  huzz- 
hiss  decision  makes  use  of  none  of  the  standard  indicators  such  as  energy  ratios  and  zero 
crossing  density,  as  these  parameters  are  highly  susceptible  to  noise  and  distortion.  In.stead, 
continuity  of  the  pitch  track  is  the  only  parameter  used  to  determine  voicing,  other  than  a very 
conservative  silence  threshold.  The  algorithm  has  been  incorporated  into  a real-time  linear- 
prediction  vocoder  implemented  on  the  Lincoln  Digital  Voice  Terminal  (I.DVT).'^ 

B.  PREPROCESSING 

In  order  to  obtain  an  accurate  pitch  estimate  from  the  spectral  information,  it  is  necessary 
to  begin  with  a spectrum  with  good  frequency  resolution,  but  one  which  spans  a sufficient  block 
of  the  frequency  space  for  there  to  be  at  least  two  harmonics  present  over  the  range  available. 
■Since  a pitch  value  of  350  Hz  is  not  unreasonable  for  a female,  the  spectral  region  to  be  analyzed 
must  be  at  least  700  Hz  wide.  Within  this  range,  one  can  arbitrarily  choose  an  I'FT  size  to 
yield  the  desired  frequency  spacing  between  samples  at  the  expense  of  computer  time.  We 
decided  to  compute  a 128-point  F’FT  to  yield  a spectrum  spanning  840  Hz  with  a resulting  fre- 
quency spacing  between  samples  of  6.  6 Hz,  which  appears  to  be  adequate  resolution  for  our 
purposes. 

Since  the  pitch  information  must  be  extracted  from  precisely  the  840-Hz  region  chosen,  it 
is  expedient  to  carefully  select  that  region  which  is  most  likely  to  yield  robust  harmonics.  Since 
the  telephone  filter  removes  the  signal  below  about  300  Hz,  one  need  not  waste  space  on  the  low 


39 

FHECKDIi'G  PAGE  tLAlfiUMOT 


KsAND 

feNERGt 


Fijr.  \’-l.  Rloc  k (Ma^rani  of  harmonic 
pitcli  defector. 


Fig.  V-2.  Preprocessing  of  speech 
waveform  to  obtain  downsampled  sig- 
nal with  desired  spectral  information. 


.260  1 

1260 

T3SiL 

1 

1 ; 

3T80H/ 

1260 

c 

) 1260 

5/80 -M? 
MSr 

( 1 ) 

LP  Filter  and 

DOWNSAMPLE 

3/J 

-1260 

lizeo 

1 

1 

1260  0 630  1260 

I 


(2)  Shift  SPECTRUM  MUL  EACH  s(n)  BY 


1260 
I 

I 

I _ 


40 


(3)  IP  filter  and  downsample  5/1 


i 

1 

! 


i 

1 

r 

i 


I 

I 


oiui  of  till-  1 ifiiuuiuy  speitruin.  llowuver,  as  one  advances  to  increasingly  higher  frequencies, 
the  s(K‘<  irimi  lieeomes  more  and  more  ragged,  and  the  harmonics  ineretisingly  difficult  lf> 
ext  raet. 

Che  region  selected  was  ,ilO  to  1050  Hz.  These  particular  numbers  wore  arrived  at  in 
large  part  bi'cause  of  appropriate  tricks  that  could  be  used  to  extract  precisely  this  piece  i,if 
the  spectrum.  rite  original  speech  waveform  wa.s  analog  filtered  and  sampled  at  150-psec  in- 
torv.ils,  yielding  a signal  containing  frequencies  up  to  37H0  Hz,  which  could  be  used  as  input 
Ixith  to  the  linear-prediction  analysis  and  to  the  pitch  extractor. 

The  first  step  in  the  pitch  extraction  is  to  filter  the  speech  down  to  1260  Hz  and  downsamide. 
discarding  two-out-of-every-three  samples  (Fig.  \'-l).  I'or  this  purpose,  a finite  impulse  re- 
sponse (Fill)  filter  seemed  to  be  a good  choice.  .Since  F'lH  filters  have  only  zero.s,  one  need 
compute  outputs  only  at  the  downsampled  rate,  which  in  our  case  represents  a three -to-one 
savings  in  time.  Furthermore,  FIR  filters  are  implementable  using  charge -coupled  devices 
(t'('l)sl,  a potentially  fast  and  inexpensive  computational  source. 

We  now  have  a waveform  sKnl  which  contains  information  from  —1260  to  -1260  Hz.  We 
know  that,  since  the  waveform  is  real,  the  negative  frequency  information  is  redundant.  Digitul- 
signal-processing  theory  tells  us  that  if  we  multiply  each  sample  of  the  waveform  sKnl  by 
6''“'*^,  w'O  will  cause  the  spectrum  to  be  rotated  by  w in  the  z-plane.  Hy  choosing  w - 90°,  vve 
cause  the  spectrum  to  be  rotated  such  that  1260/2  - 630  Hz  is  at  the  origin  (Fig.  \'-2).  \mv  a 
second  pass  of  this  complex  s2(n)  through  the  same  filter,  with  3-to-l  downsampling  again,  will 
yield  a complex  waveform  s3(n)  containing  frequencies  up  to  1260/3  420  Hz.  However,  because 

of  the  rotated  spectrum,  630  Hz  in  the  original  waveform  corresponds  to  0 Hz  in  s2(n),  and  thus 
our  doubly  downsampled  complex  waveform  contains  the  information  from  6 30  - 420  Hz.  to 
630  ‘ 420  Hz  in  the  original  speech,  which  is  the  desired  spectral  region. 

Choosing  w 90°  has  certain  advantages  in  terms  of  speed.  .Multiplication  by  involves 
only  data  transfer  rather  than  complex  multiplies,  since  the  sine  and  cosine  of  multiples  of  90° 
are  always  either  ±1  or  0.  Furthermore,  as  a consequence,  each  sample  of  s2(nl  i.s  either 
purely  real  or  purely  imaginary.  One  can  therefore  use  simple  tricks  in  the  implementation 
of  the  FIR  filter  so  that  filtering  of  this  complex  waveform  takes  essentially  no  more  time  than 
would  filtering  a real  waveform. 

Pitch  detection  generally  requires  a long  time  window  of  speech  in  order  to  assure  at  least 
two  periods  of  a low  pitched  ”oice.  Fortunately,  the  doubly  downsampled  signal  consists  of 
samples  which  are  spaced  by  132  x 3 x 3 psec,  or  1.188  msec.  Only  32  samples  of  this  wave- 
form are  required  to  yield  38  msec  of  data,  a time  window  that  is  sufficient  to  encompass  two 
periods  for  pitches  of  up  to  19  msec,  or  53  Hz,  a very  deep  male  voice. 

The  32  most  recent  samples  of  s3(n)  are  windowed  using  a standard  Hanning  window  and  then 
filled  out  with  zeros  to  make  a 128-point  input  buffer  for  the  FFT.  Rccause  thrce-fourttis  of  the 
input  samples  are  zero,  the  FFT  computation  time  can  be  reduced  by  essentially  skipping  Hie 
first  two  stages.  The  resulting  spectrum  contains  the  information  in  the  original  speech  .signal 
from  210  to  1050  Hz.  as  desired,  and  is  ready,  after  the  computation  of  the  magnitude  spectrum 
from  real  and  imaginary  components,  to  be  processed  for  harmonic  detection. 

C.  I>F„'\K  I’lCKINO  AI.GORITHM 

Tlie  self-normalized  magnitude  spectrum  obtained  from  the  windowed  s3(n)  is  generally  a 
very  smooth  fnnetion  with  peaks  only  at  the  harmonics  of  the  pitch.  The  peaks  are  of  unequal 


41 


size,  the  lai-fjer  ones  showiitj»  up  at  the  resonance  of  the  first  formant.  In  the  case  of  the 
phoneme  /i/  for  example,  a vowel  with  an  extremely  low  FI  frequency,  the  first  harmonic  is 
generally  very  large  compared  with  all  the  others.  On  the  other  hand,  the  back  vowel  /a/  gen- 
erally has  a more  graceful  bulge  in  the  high  end  of  the  spectrum,  with  the  largest  peak  near 
800  Hz  or  so  (Fig.  V-3). 

The  variability  in  size  of  peaks  would  not  be  a problem  if  there  were  never  any  spurious 
peaks.  I nfortunately,  such  is  not  the  case,  for  the  speech  waveform  never  behaves  in  any 
guaranteed  fashion.  A common  jiroblem  is  the  presence  of  subharmonic  peaks  in  the  spectrum 
half  way  between  the  true  harmonics,  possibly  caused  by  irregularities  in  the  laryngeal  excita- 
tion. These  are  nearly  always  smaller  than  their  neighbors,  hut  they  may  yery  well  not  be 
smaller  than  other  true  harmonics  not  at  the  formant  resonance.  Thus,  a simple  measure  of 
distance  lietween  peaks  above  a fixed  threshold  may  yield  a better  score  for  a pitch  choice  in 
hertz  of  half  the  true  value.  .A  further  serious  problem  with  telephone  speech  is  that  the  carrier 
cosine  often  contains  60-Hz  interference  which  shows  up  as  60-Ilz  modulation  of  the  speech 
waveform.  The  consequence  of  such  interference  is  spurious  peaks  on  either  side  of  a large 
peak,  60  Hz  away.  These  are  often  larger  than  true  harmonics  not  at  the  formant  resonance 
(Fig.  V-4). 


|-Mi503l 


JM 


/a/ 


Fig.  V-3.  Typical  first  formant  region 
spectra  for  vowels  /i/  and  /a/. 


TUML 


t 


SPURIOUS 

Fig.  V-4.  Introduction  of  spurious  peak 
in  spectrum  as  consequence  of  60-Hz 
phase  jitter. 


42 


Another  fact  which  increases  the  (lill'iculty  ul  pitch  detei  lion  is  tiic  wide  variability  in  the 
number  of  peaks  to  expect  to  find.  I'or  i hinh-pitcheil  female  von  e,  there  are  often  only  two 
peaks  which  should  even  be  considered,  and  the  pilch  is  the  distance  between  them.  For  an 
80-Hz  male  voice,  on  the  other  hand,  one  expects  to  find  at  least  10  peaks  at  the  harmonics  of 
the  pitch.  .An  algorithm  has  to  recognize  the  fact  that  there  inav  he  onlv  two  valid  peaks,  yet 
most  of  the  time  it  should  consider  fai-  more  than  two  |H-aks  in  making  a decision. 

The  algorithm  desrrit>ed  here  uses  an  iterative  technnpie  which  begins  by  considering  only 
the  two  largest  peaks.  It  then  adds  eai  h peak  in  turn,  from  largest  to  smallest,  and  after  the 
addition  of  each  new  peak  determines  a new  list  of  potential  pitches  as  the  distance  lietween 
adjacent  peaks  under  consideration.  .Such  a technique  results  in  a built-in  weighting  mechanism, 
whereby  the  largest  peak  is  included  in  every  iteration,  but  the  smallest  only  in  the  last.  The 
final  decision  algorithm  determines  the  pitch  from  a list  whii  h includes  all  the  estimates  from 
each  iteration. 

The  first  step  in  extracting  the  pitch  is  to  find  all  peaks  in  the  spectrum  and  to  eliminate 
from  consideration  those  which  are  judged  to  be  spurious.  For  each  peak,  an  amplitude  and  a 
frequency  location  are  determined.  The  location  is  defined  simply  as  the  frequency  at  which  the 
actual  peak  occurs.  The  amplitude  is  defined  not  as  the  magnitude  of  the  sample  at  the  peak, 
but  rather  as  the  "area  under  the  hump."  That  is  to  say,  the  amplitude  of  a given  peak  is  the 
non-normalized  sum  of  the  amplitudes  of  all  the  samples  from  the  previous  valley  to  the  follow- 
ing valley.  In  the  event  that  the  stun  overflows  16  bits,  it  is  clamped  at  s 1.  This  choice  of 
definition  was  found  to  effect  a better  separation  between  true  peaks  and  spurious  peaks  than 
would  a simple  amplitude  at  the  peak. 

Peaks  are  eliminated  from  consideration  if  they  are  too  small  and/or  too  close  to  a neigh- 
boring peak.  Specifically,  a peak  is  removed  if  its  location  is  within  6 samples  (40  Hz)  of  a 
larger  neighboring  peak.  A peak  which  is  more  than  6 but  fewer  than  10  samples  away  from 
its  nearest  neighbor  is  removed  if  its  amplitude  is  less  than  1 /2  the  amplitude  of  the  near 
neighbor. 

The  peaks  that  remain  after  the  elimination  step  are  given  a rank  order  according  to  size. 

At  the  first  iteration,  a single  pitch  estimate  is  entered  into  a table  of  potential  pitch  estimates, 
defined  as  the  distance  between  the  two  largest  peaks.  At  the  second  iteration,  the  third  largest 
peak  is  added  to  the  list  of  peaks  under  consideration  and  two  new  pitch  estimates  arc  added  to 
the  table,  defined  as  the  distance  between  adjacent  peaks,  among  the  three  under  consideration. 
At  each  subsequent  i^^  iteration,  the  largest  peak  among  those  remaining  is  added  to  the  list  of 
candidate  peaks  and  i new  pitch  estimates  are  added  to  the  growing  list  of  estimates,  defined 
again  as  distance  between  adjacent  peaks  (Fig.  V-5). 

Pitch  estimates  are  always  added  to  the  table  in  order,  with  the  smallest  at  the  beginning 
of  the  table.  After  each  iteration,  a score  is  computed  for  the  maximum  number  of  consecutive 
"equal"  pitch  estimates  in  the  table.  (Equal  is  defined  as  within  14  Ilz  of  the  succeeding  entry 
in  the  table.) 

As  soon  as  there  are  at  least  6 "equal"  estimates,  the  average  value  for  the  "equal"  entries 
is  defined  as  the  pitch  (in  Hz).  If  there  are  fewer  than  6 "equal"  estimates,  then  the  algorithm 
continues  with  the  next  iteration  until  the  size  of  the  next  available  leftover  peak  is  less  than 
l/lO  the  size  of  the  largest  peak,  or  until  a maximum  of  7 peaks  has  been  exhausted.  If  either 
of  these  conditions  is  met,  the  algorithm  exits  in  spite  of  an  inailequate  score,  and  chooses  as 
the  pitch  value  the  average  of  the  longest  string  of  "equal"  estimates.  In  the  case  of  a tie  be- 
tween two  strings,  the  one  with  the  larger  pitch  estimate  is  arbitrarilv  defined  to  be  the  pitch. 


41 


I I I I CO»*^fCT  mawmomCS 


Fig.  V-5.  Illustration  of  iterative  scoring  algorithm  under  artificially 
adverse  conditions. 


lol  FOR  BIT  STREAM  SEQUENCE  10)01 

f 


> ' T.  MEDIAN 

ibl  FOR  NUMBER  SEQUENCE  5,6,12,7,8 

t 


Fig.  V-6.  Median  smoothing  filter  (a)  for  bit  stream,  and  (b)  for  function 


This  harmonic-detection  algorithm  is  run  twice  per  ^0-mseo  Irame  on  spectra  of  data 
spaced  by  10-msec  intervals.  The  output  is  thus  an  oversampled,  unsmoothed  pitch  contour, 
and  the  final  step  in  the  processing  i.s  to  make  the  buzz-hiss  decision  and  decide  a single  pitch 
value  for  each  frame.  F’or  this  purpose,  the  pitch  contour  is  |»assed  through  a i-point  followed 
by  a 5-point  median  smoothing  filter^  ll'ig.  V-6(a-b)). 

The  buzz-hiss  decision  is  made  almost  e.xclusively  on  the  basis  of  the  smoothness  of  the 
pitch  contour.  Since  the  only  true  feature  distinguishing  voiced  from  unvoiced  speech  is  the 
presence  of  pitch  pulses,  and  since  the  linguistic  and  acoustic  constraints  on  the  pitch  make  it 
highly  unlikely  for  a true  pitch  value  to  change  dramatically  in  the  course  of  a 10-msec  interval, 
one  can  expect  that  in  voiced  regions  the  pitch  will  change  little  from  sample  to  sample.  In 
unvoiced  regions,  on  the  other  hand,  there  is  little  reason  to  expect  the  algorithm  to  arrive  at 
anything  other  than  random  values  for  the  pitch  choice.  I'he  only  other  feature  used  by  the 
buzz-hiss  decision  is  an  extremely  conservative  silence  threshold  on  the  doubly  downsampled 
waveform  s3(n). 

Thus,  the  buzz-hiss  decision  operates  as  follows.  If  the  energy  in  s3(n)  is  less  than  the 
silence  threshold,  consider  the  frame  hiss  and  set  the  pitch  equal  to  0.  if  none  of  the  three 
input  samples  to  the  3-point  median  smoothing  filter  are  "equal"  (where  equal  is  here  defined 
as  within  33  Hz  of  each  other),  consider  the  output  of  the  median  smoother  to  be  0 (hiss).  Fi- 
nally, if  no  more  than  2 of  the  5 ordered  input  samples  to  the  5 -point  median  smoother  are 
"equal"  (this  time  within  20  Hz  of  each  other),  consider  the  output  of  the  5-point  smoother  to 
be  0 (hiss)  (Fig.  V-7). 


l-t-iMlil 


INPUT 

PITCH 

CONTOUR 


AFTER 
3 POINT 
MEDIAN 
SMOOTHER 


AFTER 
6-  POINT 
MEDIAN 
SMOOTHER 


Fig,  V-7.  Illustration  of  median  smoothing  buzz-hiss  algorithm 
under  artificially  adverse  conditions. 


45 


This  algorithm  works  surprisingly  well  for  (ieterminirig  huzz-hiss.  It  depends  upon  a 
10-msec  rather  than  a 20-msec  update  of  the  pitch.  Typical  Imzz-hiss  indicators  such  as  zero 
crossing  density,  Kl/RO,  and  high-low  energy  ratio.s  were  avoided  on  purpose,  becau.se  these 
are  likely  to  be  degraded  as  a consequence  of  filters,  distortions,  and  noise  to  which  the  input 
speech  may  have  been  subjected. 

I).  I. DVT  IMPLEMENTATION 

The  algorithm  as  described  above  was  incorporated  into  a real-time  linear-prediction 
vocoder  implemented  on  the  I. DVT.  a S5-nsec  instruction  cycle  microcomputer,  with  a stan- 
dard instruction  set,  designed  and  built  at  Lincoln  Laboratory.  Memorv  size  is  a limiting 
factor  with  the  LDVT,  for  it  has  onlv  2000  octal  program  and  1000  oi  tal  data  memorv  locations. 
There  is,  however,  a rapid  access  outboard  memory  containing  4000  octal  location.s  from  which 
both  programs  and  data  can  be  retrieved. 

The  pre -emphasized  analog  waveform  was  filtered  and  sampled  at  112-psec  intervals,  and 
a nonoverlapping  buffer  of  153  samples  was  accumulated  for  each  20-msec  frame.  These  153 
samples  were  used  as  input  both  to  the  autocorrelator  and  to  the  first  EIK  filter,  FIKl  (refer 
to  Fig.  V-1).  The  51  output  samples  of  FIHl  were  complex  multiplied  by  ^ and  processed 

again  through  the  FIR  filter,  using  certain  tricks  to  handle  the  complex  input  data,  to  yield 
17  new  samples  of  s3(n).  The  128-point  FFT  was  computed  twice  per  frame  by  moving  along 
alternately  by  9,  then  8,  samples  of  s3(n).  The  computation  of  the  magnitude  spectrum  from 
the  32  most  recent  samples  of  s3(n),  padded  out  with  zeros  to  128,  completed  the  preprocessing. 

For  the  postprocessing,  a table  of  peak  locations  and  a corresponding  table  of  amplitudes 
were  determined  and  arranged  in  descending  order  with  respect  to  peak  size.  Following  this 
step,  the  first  two  location  entries  were  reordered  and  the  difference  between  the  two  locations 
was  entered  as  the  first  pitch  estimate.  Then  the  third  entry  in  the  location  table  was  inserted 
in  order  and  two  new  pitch  estimates,  defined  as  difference  between  adjacent  entries,  were 
added,  also  in  order,  to  the  growing  pitch-estimate  table.  Now  the  three  ordered  entries  in 
the  estimate  table  could  be  scored  for  "equality"  of  adjacent  elements,  and  an  iteration  is 
completed.  At  each  i*^  iteration,  the  i^*’  location  is  inserted  in  order,  and  i new  pitch  esti- 
mates are  added  in  order  to  the  estimate  table.  Processing  is  complete  either  when  a peak  of 
insufficient  amplitude  is  encountered,  or  a score  of  greater  than  seven  adjacent  "equal"  esti- 
mates is  obtained.  At  this  point,  the  mean  value  of  the  "equal"  set  is  defined  as  the  (unsmoothed) 
pitch. 

An  appreciation  of  the  complexity  of  the  algorithm  can  be  gained  from  some  numbers  as- 
sociated with  the  LDVT  implementation.  The  total  number  of  memory  locations  required  for 
the  entire  pitch  algorithm  was  1425  decimal,  divided  about  50-50  between  instructions  and  data. 
The  amount  of  time  consumed  for  the  preprocessing  (FIR  filters  and  computation  of  magnitude 
spectrum)  was  2.66  msec  per  10-msec  frame,  or  a little  over  a quarter  of  the  time  available. 

The  time  required  for  the  postprocessing,  or  decision  algorithm,  was  extremely  variable  and 
therefore  difficult  to  determine,  but  a rough  calculation  indicates  that  it  was  insignificant  com- 
pared with  preprocessing  time.  For  purposes  of  comparison,  the  total  time  requirement  was 
roughly  twice  the  amount  required  by  the  I.DVT  implementation  of  the  r.old-Habiner  time-domain 
pitch  detector. 


46 


K.  RKSCI.TS 


The  harmonic  pitch  detector,  incorporated  into  a real-time  4000-bits /sec  I.PC  vocoder, 
was  evaluated  subjectively  by  means  of  an  AH  comparison  with  the  (lold-Rabiner  time-domain 
detector,^  incorporated  into  an  otherwise  identical  vocoder.  A system  was  developed  on  the 
UNIVAC  1^19  facility  whereby  the  two  vocoders  could  be  exchanged  into  the  I. DVT  essentially 
instantaneously,  while  speech  subjected  to  various  distortions  and  corruptions  was  continuously 
being  played.  The  listener  could  thus,  because  of  the  instantaneous  juxtaposition,  readily  com- 
pare the  quality  of  the  speech  produced  using  the  frequency-  and  time -domain  pitch  detectors. 

Input  speech  subjected  to  typical  telephone-channel  degradations  was  generated  by  means 
of  a second  l.DVT  containing  a real-time  digital  telephone-idiannel  simulator^  (Dig.  V-8|.  The 
parameters  of  the  simulator  were  controlled  at  the  console  and  thus  the  user  could  conveniently 
test  the  performance  of  the  two  pitch  detectors,  with  increasing  amounts  of  various  corruptions. 
For  example,  if  one  wished  to  investigate  the  sensitivity  of  the  two  pitch  detectors  to  Gaussian 
noise,  one  could  set  all  parameters  of  the  telephone  simulator  to  zero  except  the  Gaussian 
noise.  The  noise  amplitude  could  then  be  slowly  increased  while  the  two  pitch  detectors  were 
alternately  loaded  into  the  other  l.DVT. 


l-l-iiint 


Fig.  V-8.  Block  diagram  of  telephone-channel  simulator  used  to  test  harmonic 
pitch  detector  performance. 


47 


I sing  this  experimental  setup,  we  were  alile  to  examine  tiie  relative  sensitivity  of  the  two 
pitch  detectors  to  the  various  distortions  in  the  telephone  lines.  The  major  source  of  breakdown 
in  the  Gold-Kabiner  pitch  detector  is  the  telephone  bandpass  filter,  which  removes  information 
below  100  Hz,  attenuates  llu'  amplitude  up  to  as  much  as  1000  llz,  and  changes  the  phase  rela- 
tionship. .Subjective  listening  tests  show  a substantial  improvement  in  quality  when  the  harmonic 
pitch  detector  is  substituted  for  the  time-domain  detector,  under  conditions  when  only  the  tele- 
phone filter  IS  present  in  the  simulation.  I'igure  V-9(a-b)  shows  an  example  where  the  period- 
icity IS  not  evident  in  the  waveform,  but  is  well  indicated  in  the  spectrum,  when  the  speech  is 
processed  through  a typical  telephone  filter. 


0 


lb) 


Fig.  V-9.  Waveform  and  spectrum  of  (a)  telephone  filtered  speech, 
and  (b)  speech  corrupted  with  jet-engine  noise. 


Other  audible  degradations  in  typical  telephone  tines  are  Gaussian  noise  (thermal  noise  and 
shot  noise)  and  phase  jitter.  The  latter  is  a low  frequency  modulation  of  the  waveform  as  a 
consequence  of  (usually  60  Hz)  interference  in  the  generation  of  the  carrier  cosine.  In  some 
European  lines,  the  50-Hz  jitter  can  be  as  high  as  35-deg  peak-to-peak  amplitude,  causing  a 
peculiar  granular  quality  and  an  echo  effect  in  the  speech. 

As  might  be  anticipated,  both  detectors  were  sensitive  to  Gaussian  noise,  although  the 
breakdown  as  a consequence  of  Gaussian  noise  was  not  as  great  as  might  be  expected.  The 
Gold-Rabiner  detector  was  far  more  sensitive  to  the  telephone  filter  alone  than  to  Gaussian 
noise  alone,  set  at  the  level  typically  encountered  in  telephone  lines  (—40  dBmc).  The  two 
detectors  were  judged  to  be  about  equally  sensitive  to  Gaussian  noise. 


4H 


Phase  Jitter  contributes  an  additional  degradation  to  the  liine-do/iiain  detectoi , particularly 
at  the  levels  encountered  in  European  lines.  Included  in  the  harmonic  pitch  detector  decision 
algorithm  is  a step  to  suppre.ss  peaks  too  close  to  neighliors  and  of  insufficient  amphtude,  which 
makes  the  detector  less  sensitive  to  phase  jitter  than  llu'  time-domain  detector.  .-\t  typical 
American  line  settings,  [ihase  jitter  presents  little  problem  to  either  detector. 

The  i-emaimng  parameters  in  the  simulator,  with  the  possilile  exieption  of  harmonic  dis- 
tortion, seem  to  have  little  effect  on  pitch  extraction,  at  the  levels  commonly  found  in  the 
telephone  system. 

The  two  pitch  detectors  were  also  evaluated  on  certain  other  tyjies  of  degraded  speech. 
Specifically,  speech  in  the  presence  of  (1)  helicopter  noise,  t i)  noise  in  a large  jet  aii'plane, 
and  (3)  60-llz  hum  was  processed  through  both  vocoders,  and  the  quality  was  compared.  Heli- 
copter noise  was  found  to  be  concentrated  in  frequencies  above  1000  llz,  and  therefore  caused 
only  minor  degradations  in  both  detectors,  .let  noise  includes  a large  component  in  the  low- 
frequency  region  (below  300  Hz)  and  therefore  interferes  rather  severely  with  the  time-domain 
pitch-extraction  algorithm.  The  same  is  true,  obviously,  for  OO-Hz  hum,  whose  strongest 
component  is  at  60  Hz  but  which  contains  weaker  harmonics  at  higher  frequencies. 

For  both  the  60-Hz  hum  and  the  jet -engine  noise,  the  huiunonic  pitch  detector  performed 
substantially  better  than  the  time-domain  detector.  Even  at  levels  of  hum  in  which  the  time- 
domain  detector  completely  broke  down,  choosing  60  Hz  as  tlie  pitcli,  the  harmonic  detector 
came  through  with  clear  speech,  figure  V-9(b)  shows  an  example  where  the  pitch  information 
is  obscure  in  the  waveform  but  evident  in  the  spectrum,  when  the  speech  is  corrupted  with  large 
airplane  jet  noise. 

For  one  specific  kind  of  distortion  in  transmission  channels,  the  harmonic  detector  can 
actually  correct  the  distortion  and  improve  the  quality  of  the  original  speech.  This  is  for  the 
situation  in  which  there  happens  to  be  a very  large  frequency  offset  between  the  transmitter  and 
receiver  carrier  in  a single  side-band  transmission  system.  In  such  a case,  both  positive  and 
negative  frequency  are  shifted  in  toward  the  origin  by  an  amount  equal  to  the  offset,  such  that 
the  original  pitch  harmonics  are  no  longer  harmonics.  The  subjective  result  is  that  the  per- 
ceived pitch  is  wrong,  and  a small  amplitude  background  hum  is  heard  at  the  correct  pitch. 

The  harmonic  pitch  detector,  since  it  doe.s’not  depend  upon  the  fundamental  but  only  upon 
spacing  between  harmonics,  can  restore  the  original  speaker's  pitch  in  the  synthesized  speech, 
and  remove  the  background  hum.  The  formant  frequencies  are  of  course  still  shifted,  but  the 
formant  shift  is  a second-order  effect,  perceptually. 

RE FERENCES 

1.  .S.  .Seneff,  "A  Real-Time  Digital  Telephone  Simulation  on  the  I.incoln 
Digital  Voice  Terminal,"  Technical  Note  1975-66,  I.incoln  Laboratory, 

M.  I.  T.  (30  December  1975),  DDC  AD-A021409/8. 

2.  P.  E.  Blankenship  et  aL.  "The  I.incoln  Digital  Voice  Terminal  System," 

Technical  Note  1975-5  3,  I.incoln  Laboratory,  M.  I.  T.  (25  August  1975), 

DDC  AD-A017569/5. 

3.  L.  R.  Rabiner,  M.  R.  Sambur,  and  C.  E.  Schmidt,  ".Applications  of  a 
Nonlinear  Smoothing  Algorithm  to  .Speech  Processing,"  IEEE  Trans. 

Acoust.,  Speech,  and  Signal  Processing  A^P‘.iJ.  “’‘'2  (1975). 

4.  B.  Gold  and  L.  R.  Rabiner,  "Parallel  Processing  Techniques  for  Esti- 
mating Pitch  Periods  of  Speech  in  the  Time  Domain."  .1.  Acoust.  .Soc. 

Am.  46.  442  (1969). 


49 


VI.  OPTIMUM  SPKFX'H  CLASSIFICATION  AND  ADAPTIVK  NOISE  CANCELLATION 
A.  INTRODUCTION 

There  are  a variety  of  applications  in  which  it  is  necessary  to  be  able  to  classify  a given 

set  of  speech  data  as  corresponding  to  voiced  speech,  unvoiced  speech,  or  silence.  For  the 

1 -4 

synthesis  of  speech  using  Linear  I'redictive  Coding  (LPC)  techniques,  for  example,  it  is  nec- 
essary that  the  speech  signal  be  classified  as  voiced  or  unvoiced.  This  information  is  trans- 
mitted to  the  speech  synthesizer  along  with  coefficients  that  represent  an  all-pole  linear  filter 
model  for  the  vocal  tract.  For  voiced  speech,  the  filter  is  excited  by  a periodic  train  of  im- 
pulses, whereas  a white -noise  excitation  is  used  when  unvoiced  speech  is  to  be  synthesized. 

The  ability  to  detect  silence  is  of  interest  in  digital  communications  in  which  channel  capac- 

5 

ity  is  at  a premium.  By  detectmg  intervals  of  silence,  other  data  streams  can  be  interleaved 
with  the  speech  conversation,  thereby  maximizing  the  utilization  of  the  available  bandw'idth. 

C 

Another  application  of  silence  detection  arises  in  conferencing  situations.  By  detecting  when  a 
set  of  speakers  are  silent,  their  lines  can  be  disconnected  from  the  superposition  of  inputs  so 
that  an  enhancement  of  synthesizer  input  SNR  can  be  obtained. 

Solutions  to  the  classification  problem  have,  for  the  most  part,  been  developed  on  an  ad  hoc 
basis  in  which  an  individual  discriminant  is  proposed  which  seems  to  characterize,  in  one  way 
or  another,  the  attributes  of  the  three  possible  speech  events.  In  a recent  paper,  Atal  and 
Rabiner^  proposed  an  algorithm  that  simultaneously  computes  five  of  the  most-significant  dis- 
criminants and  uses  a hypothesis  testing  strategy  to  assign  a given  set  of  observations  to  one  of 
the  three  speech  classes. 

With  few  exceptions,  most  notably  the  work  of  Atal  and  Rabiner,  most  of  the  speech  re- 
search reported  to  date  has  dealt  with  a speech  environment  that  has  been  carefully  controlled 
in  the  sense  that  background  noise  and  interference  signals  have  been  eliminated  from  the  speech. 
It  is  generally  known  that  the  intelligibility  of  modern  vocoders  is  seriously  degraded  when  noise 
and  interference  signals  are  superimposed  on  the  speech  data.^  Since  there  are  many  practical 
problems  in  which  noise  and  interference  arise,  it  is  of  interest  to  develop  more  general  speech- 
processing techniques  designed  to  eliminate  the  noise  as  much  as  possible. 

In  this  section,  it  is  assumed  that  the  speech  signals  are  corrupted  by  additive  Gaussian 
noise  that  may  or  may  not  be  white.  The  unvoiced -speech  signal  is  modeled  as  a zero-mean 
Gaussian  random  process  having  a known  covariance  function.  Voiced  speech  is  modeled  as  a 
zero-mean  Gaussian  quasi-periodic  random  process.  By  using  these  models  as  a starting  point, 
the  classification  problem  is  formulated  as  a statistical  hypothesis  test  and  is  solved  using  sta- 
tistical decision  theory.  Subject  to  the  validity  of  the  underlying  speech  models,  the  resulting 
signal-processing  algorithm  is  optimum  in  the  sense  that  the  probability  of  a decision  error  is 
minimized.  The  advantage  of  this  approach  is  that  the  discrimination  criteria  are  synthesized 
from  the  model,  rather  than  being  selected  on  an  ad  hoc  basis. 

The  classification  problem  is  recognized  as  a Gauss-in-Gauss  detection  problem  for  which 

7 

solutions  have  been  cataloged  by  Van  Trees.  The  estimator-correlator  structure  was  chosen 
since  it  led  most  naturally  to  a practical  implementation.  If  pitch  information  is  available,  ad- 
ditional discrimination  can  be  provided  in  the  voiced-speech  channel  using  a comb  filter  tuned 
to  the  most-recent  estimate  of  the  pitch. 


The  ability  to  detei  t the  silc/it  inti  t val.s  (noise  alone)  mvuns  that  the  statistics  of  the  clut- 
ter can  be  learned  and  used  to  inipleinent  ndupuve  \\  iener  lUters  to  entiance  the  speech  signals 
prior  to  coding.  In  tliis  mode,  the  adaptive  pi  el'ilter  can  be  used  as  a preprocessor  for  any 
narrowband  or  wideband  speeeli  encoder. 

An  extensive  experimental  program  .las  developed  to  evaluate  tlie  classifier  in  a variety  of 
acoustic-noise  environments  including  sliipboard  noise,  office  noise,  helicopter  noise,  and  noise 
in  an  airborne  command  post.  Tlu  results  for  airborne-command-post  noise  are  included  in 
this  section. 

B.  MODELS  FOR  SlI  .EXC  i:,  I NVOK'KJJ,  AND  VOICED  SPEECH 

The  basic  problem  of  detecting  the  presence  of  silence,  unvoiced  speech,  or  voiced  speech 
in  a given  set  of  data  can  be  formulated  as  a statistical  test  for  choosing  one  of  the  three 
hypotheses; 

silence  y(n)  = w(n) 

II2:  unvoiced  y(n)  - u(n)  t w(n) 

voiced  y(n)  = v(n)  t vv(n)  (VI-1) 

where  w(n),  u(n),  and  v(n)  represent  the  n*^^'  samiile  of  noise,  unvoiced-speech,  and  voiced-speech 
waveforms,  respectively.  Based  on  a set  of  observations  y(l),  y(2),  . . . , y(N),  it  is  desired  to 
develop  a decision  rule  for  determuiing  which  of  the  three  hypotheses  "best”  characterizes  the 
data  set.  This  is  the  classification  problem.  In  order  to  synthesize  an  optimum  decision  rule 
in  the  sense  that  a classification  is  made  with  minimum  probability  of  error,  it  is  necessary  to 
develop  statistical  models  that  cnaracterize  the  data  for  each  of  the  three  speech  events. 

To  begin  with,  the  interference  will  be  assumed  to  consist  of  simply  zero-mean  white 
Gaussian  noise.  Once  the  detector  structure  has  been  analyzed  and  understood  for  this  case, 
the  generalization  to  nonvvhite-noise  spectra  follow  s almost  by  inspection. 

In  order  to  derive  the  structure  of  the  classifier,  it  suffices  to  model  the  unvoiced-  and 
voiced-speech  waveforms  as  sample  functions  of  Gaussian  random  processes  having  zero  means 
and  covariance  functions  R (k)  and  R ^(k),  respectively.  In  addition,  voiced  speech  is  assumed 
to  be  quasi-periodic  in  the  sense  that  R^,(k  ^ T)  = R^Ik).  where  T is  the  period  of  the  process. 
This  means  that  almost  every  sample  function  is  periodic  with  period  T (see  Ref.  8). 

The  preceding  discussion  can  be  summarized  succinctly  by  the  following  set  of  modeling 
equations.  Under  hypothesis  IL,  the  observed  data  set  is  given  by: 

y(n)  = Sj(n)  + w(n)  i = 1,  2,  3 (VI-2) 

where  s^(n)  = 0 for  silence,  82(11)  is  a Gaussian  random  process  with  mean  zero  and  covariance 

R^(k)  for  unvoiced  speech,  and  s^(n)  is  a zero-mean  quasi-periodic  Gaussian  random  process 

with  covariance  function  R (k)  for  voiced  speech.  In  all  cases,  the  noise  term  w(n)  represents 

^ 2 
a zero-mean  Gaussian  white-noise  random  process  having  the  correlation  function  R^('^)  = 5(k). 

C.  THE  OPTIMUM  CLASSIITER  AGAINST  WHITE  NOISE 

The  optimum  classifier  processes  the  raw -speech  data  y(l),  y(2),  . . . , y(N)  in  such  a way 
that  a decision  is  made  with  minimum  probability  of  error  on  whether  the  given  interval  of  signal 


S2 


r 


should  be  classified  as  voiced  speech,  unvoiced  speech,  or  silence.  Using  statistical  decision 
theory,  the  minimum  probability  error  decision  rule  is: 

"Ueclare  hypothesis  llj  to  be  true  if  and  only  if  the  a posteriori  probability  that 
Ilj^  is  true  conditioned  on  the  observation  set  y(i],  y(2),  . . .y(N)  is  largest,"  i.e., 

p fH.  iy(N),  . . . , y(l)|  = max  p (H.  | y(N), . . , , yd)] 
k=t,2,3 

Signal-processing  configurations  of  the  likelihood -ratio  test  have  been  documented  by  Van  Trees.^ 
For  the  special  case  of  ternary  hypotheses,  zero  means,  and  stationary  random  processes,  the 
test  is  implemented  by  computing  three  sufficient  statistics  denoted  by  < ^ (i  = 1,2,3).  The  first 
component  of  the  i^*^  statistic  is 

.N 

''vi  " ^ y(n)  s.(n)  (VI-3) 

n=l 

where  s.(n)  is  the  linear  least-squares  unrealizable  estimate  of  the  i^*^  signal  s.(n).  The  bias 

* th  ^ 

component  of  the  i sufficient  statistic  is 


t 


Bi 


G.(f) 

N /2 
o' 


df 


i = 1,  2,  3 


(VI-4) 


where  T = N/F  is  the  observation  time  of  the  process,  F is  the  sampling  rate,  G.(f)  is  the 

^ th  ^ ^ 

power  spectrum  of  the  i“  random  process,  and  N /2  is  the  two-sided  white-noise  spectral  den- 

th  ^ 

sity.  The  complete  i sufficient  statistic  is 


^i  = ^yi  ^Bi 


i = 1,2,  3 


and  the  test  consists  of  choosing  the  largest  of 
f.  + InP.  i=  1,  2,  3 


(VI-5) 


(VI-6) 


where  P.  is  the  a priori  probability  that  hypothesis  Hj  is  true.  The  goal  now  is  to  use  the  gross 
attributes  of  speech  signals  to  simplify  the  computations  involved  in  implementing  the  likelihood - 
ratio  test. 

Under  hypothesis  H^,  which  corresponds  to  sUence,  the  anticipated  signal  is  s^(n)  = 0. 
Therefore,  s^(n)  = 0 whence  fy  = 0,  = 0,  and  = fnP^.  The  likelihood -ratio  test  reduces 

to  computing  only  two  statistics: 


f2  = •'y  +fnP2-fnP^  (VI-7a) 

i,  = /„  + /_  +/nP,  -inP.  (VI-7b) 

3 73  B3  3 1 


in  which  only  f and  i involve  the  raw  data,  and  being  fixed  biases  reflecting  the 
y-i 

average  energy  in  the  ensembles  of  unvoiced-  and  voiced-speech  sounds.  Letting 


^u  ' ^82  ^2  ^ 


^uv*  “^82^^03 


(VI-8a) 

(VI-8b) 

(VI-8C) 


53 


the  classification  rule  reduces  to  the  following; 


If  1 ^ A„ 

yz  “ 

and 

1 ^ A 

^3  '' 

declare  silence 

(VI-9a) 

If  f . - 
y,  > A 
-'2  u 

or 

1 '>  A 

^3 

and  1 

>'2 

- 1 >X 

y^  uv 

declare  unvoiced  speech 

(VI-9b) 

If  1 > A 

^2  “ 

or 

t > A 

yj  V 

and  f 

yz 

- l < A 

y3  uv 

declare  voiced  speech 

(VI-9c) 

In  order  to  simplify  the  te.st 

further,  it 

is  noted  from 

Kq.  (VI-4)  that  the  bias  terms  i,,  and 

are  related  to  the  energy  in  the  ensemble  of  unvoiced-  and  voiced-speech  sample  functions. 
3 

If  a global  average  is  taken,  ttie  voiced-speech  spectrum  will  liave  significantly  more  energy 
than  that  of  unvoiced  speech,  whicli  would  contribute  a negative  bias  in  favor  of  the  unvoiced- 
speech  hypothesis.  Using  this  bias  would  be  valid  if  voiced  speech  were  truly  stationary.  In 
fact,  however,  not  only  do  the  spectral  properties  change  from  frame  to  frame,  but  more  im- 
portantly the  amplitude  undergoes  a slowly  increasing  and  decreasing  modulation  at  the  begin- 
ning and  ending  of  a voiced  sound.  Since  10-  to  20 -msec  frames  of  speech  represent  the  data 
base  upon  which  a classification  is  to  be  made,  then  from  a sample  function  point  of  view  the 
energy  in  a frame  of  unvoiced  speech  or  a frame  of  voiced  speech  could  be  comparable.  The 
inclusion  of  the  ensemble  average  energy  bias  term  would  therefore  incorrectly  favor  unvoiced 

speech.  Therefore,  the  bias  terms  f,,  and  f_  must  be  assumed  to  be  equal.  Under  this  con- 

* 2 ^3 

dition,  the  thresholds  reduce  to 

Ay  = -f  - fn  1*2  + f n Uj  (Vl-lOa) 


= -Ip  - in  I>^  1 In  Uj  (Vl-lOb) 

Ayy  = -<nl>2  + fnl>3  (VI-lOc) 


where  represents  an  unknown  bias  term  related  to  the  a priori  knowledge  of  the 

energy  in  the  unvoiced-  and  voiced-speech  signals. 

Although  the  Bayesian  detection  theory  demands  that  the  bias  term  and  a priori  probabilities 
be  calculated,  a more  practical  method  for  determining  the  thresholds  would  be  to  train  the  sys- 
tem against  noise  and  then  choose  those  values  that  keep  the  false-alarm  rate  at  a value  consis- 
tent with  the  system  objectives.  For  example,  a much  greater  penalty  is  paid  for  failing  to  de- 
tect speech  than  falsely  classifying  noise  as  speech.  Therefore,  the  thresholds  most  likely 
should  be  set  close  to  the  1 -sigma  values  of  f and  I obtained  during  the  noise  training  phase. 

This  strategy  is  ideal  for  self-adaptive  tracking  of  the  noise  statistics  should  they  be  nonstation- 
ary. The  voicing  threshold  is  most  reasonably  approximated  by  zero  when  the  SNR  is  large 
or  the  noise  is  white.  When  this  is  not  the  case,  this  threshold  can  also  be  trained  to  the  1 -sigma 
value  of  i —I 

yz  ^3 

As  a result  of  the  preceding  analysis,  the  only  statistics  that  must  be  calculated  at  each 
frame  time  are  the  correlations 


N 

ty  = Yj  y(")  Si<n)  i = 2,  3 

^ i 


(Vl-ll) 


54 


J4J«J  H 


whofi'  y(n)  is  tiic  raw-.spoecli-plus-iioiso  data,  and  s.(n)  is  tlic?  linnai-  least-squares  unreaiizaldi- 
estimate  of  Sj(n)  given  tliat  liypotliesis  II.  is  true.  Mince  the  unvoiced-  and  voiced-speech  wave- 
forms are  quasi-stationary,  the  filter  that  results  in  Sj(n)  given  that  y(n)  = s.(n)  + w(n)  has  the 
transfer  function 

"i"'  G.(f)  ( N'  Jl  ■ (\  r-12) 

t o 

The  filters  defined  hy  lap  (\T-12)  obtain  enlianced  diseriminalioji  against  noise  hy  passing 
only  those  frequencies  whore  the  signal  [jower  is  substantially  lai-ger  than  the  noise  power. 
Kniianced  voice-unvoiced  discrimination  depends  on  tlie  implicit  ortiiogonality  of  the  two  random 
processes,  as  rcnectcd  by  the  degree  to  which  tlie  spectral  densities  are  correlated,  lioth  these 
detection  statistics  can  be  improved  by  capitalizing  on  the  quasi-periodit:  projjerties  of  voiced 
speech.  If  the  voiced-speech  process  is  periodic  with  period  1',  then  the  voiced-speech  power 
spectrum  is  more  accurately  represented  by 

G^lf)  C(f;T)vJ^(f)  (VI-13) 

where  G (f)  represents  the  gross  properties  of  the  spectral  envelope,  and  ('(f;T)  is  a comb  fil- 
ter reflecting  the  fine  structure  of  the  periodic  spectrum.  If  the  period  is  maintained  for  M pe- 
riods, then 

G(f;T)  - • expljrlM  - 1)  f/1'1  (VI-11) 

where  F = l/T  represents  the  pitch  frequency.  Not  only  docs  the  comb  filter  enhance  the  voiced- 
speech-to-noise  ratio,  but  it  also  increases  the  orthogonality  of  tlie  voiced  and  unvoiced  spectra. 
In  order  to  exploit  the  additional  discrimination  implicit  in  the  comb  filter,  it  is  necessary  that 
the  pitch  period  be  known.  A discussion  of  how  the  pitch  is  to  be  determined  will  be  deferred 
to  Sec.  E below. 

Subject  to  the  assumptions  that  the  envelopes  of  the  unvoiced-  and  voiced-speech  pow'cr  spec- 
tra are  known  and  that  the  pitch  period  for  voiced  speech  can  be  estimated,  then  the  optimum 
classifier  can  be  implemented  as  shown  in  Fig.  VI-1. 


Fig.  VI-1. 


Optimum  speech  classifier  against  w bite  noise. 


S3 


Oi  coufsc,  all  this  information  is  not  available  a priori  and  it  will  be  necessary  to  introduce 
approximations  to  the  filtering  and  estimation  operations  while  maintaining  the  basic  structure 
of  the  estimator-correlator  receiver.  This  will  be  tiie  goal  of  the  next  section. 

I).  PHACTICAL  [Min.EMKNTATlON  OF  TllK  KSTIMATOU-fORKFI.ATOR 
Si'KKC'H  tT.ASSllTKR 

For  voiced  speech,  the  optimum  minimum  mean-squared  error  filter  has  the  transfer 
function 


G^(f)  r(f;T) 

^ G^(f)  C(f  /fni^^  (VI- 15) 

which  passes  those  frequencies  at  whic)i  the  signal  power  is  substantially  larger  than  the  noise 
power  and  rejects  all  others.  Certainly,  the  comb  filter  in  the  denominator  contributes  to  the 
definition  of  those  frequencies  at  which  noise  rejection  should  occur.  However,  in  white  noise 
approximately  the  same  rejection  performance  can  be  obtained  by  a cascade  combination  of  the 
comb  filter  and  the  least-squares  filter  designed  on  the  basis  of  simply  the  spectral  envelope. 
Therefore,  the  voiced-speech  estimator  filter  is  taken  to  be 


H^(f)  = C(f;T)  • 


G 


v 


G^(f) 

(Trnyz 


(VI-16) 


For  unvoiced  speech,  the  estimator  filter  is 

Gu(f) 

"u‘‘^  ^ Gy(f)  + • 


(VI-17) 


Setting  i = 2 for  unvoiced  and  i = 3 for  voiced,  the  Wiener  filters  based  on  the  spectral  envelopes 
for  both  cases  can  be  written  as 


H.(z)  = ^ 

k=-oo 


where  the  coefficients  aj^  satisfy  the  Wiener-Hopf  equation 


(VI-18) 


CO 

L [Rj(k  - j)  t fi(k  - j)]  = R.(j)  -«o  < j < w (VI-19) 

k=-°o 

where  = (N^/2)  represents  the  energy  in  the  noise  process  (F^  is  the  sampling  rate),  and 
where  R2(k),  R^(k)  are  the  sampled  data-correlation  functions  corresponding  to  the  power  spec- 
tra G^(f),  G^(f),  respectively.  In  practice,  the  correlation  functions  can  be  suitably  truncated  and 
the  1 Eq.  (VI-18)  can  be  efficiently  solved  using  the  Levinson  recursion.^  Of  course,  the  solution 
requires  that  the  correlation  functions  for  an  ensemble  of  unvoiced-  and  voiced-speech  sample 
functions  be  computed  for  a large  class  of  utterances  and  a large  class  of  speakers.  In  order 
to  bootstrap  the  system,  initial  classification  would  have  to  be  done  manually,  which  would  be 
extremely  tedious  and  time  consuming.  In  order  to  avoid  this  problem,  a more  practical  and 
robust  strategy  is  proposed  based  on  the  well-known  global  properties  of  unvoiced-  and  voiced- 
speech  spectra  and  a close  examination  of  the  filtering  operation  defined  in  Eqs.  (VI-16)  and 
(VT-17). 


k 


56 


h 


■J'lie  ossoiico  of  llic  Wiener  filter  i.s  to  pass  tliose  frequinicies  at  whicli  tlie  speecii  power  is 
substantially  larger  than  the  noise  power.  As  a good  first  appi-oxiniation,  it  seems  reasonable 
to  approximate  tlie  Wiener  filter  by  a passband  filter  that  passes  "most"  of  the  energy  in  an 
unvoiced-  or  voiced -si)Occli  sound.  For  unvoiced  speccti,  it  can  be  assumed  tiiat  "most"  of  the 
energy  will  be  abov<>  1000  Hz,  while  for  voiced  speech  "most"  of  the  energy  will  be  below 
2000  Hz.  While  restricting  the  estimator  filters  to  these  frequencies  im|jrovcs  the  detection 
SN1{  of  unvoiced  ami  voiced  speech,  of  at  least  equal  importance  is  the  ability  of  the  unvoiced 
filter  to  reject  voiced  speech  and  vice  versa.  Since  the  first  formant  of  voiced  speech  is  approx- 
imately 1000  Hz.  tiicn,  if  the  cutoff  of  the  unvoiced-speech  filter  is  above  1250  Hz,  most  of  tlie 
unvoiced-speech  energy  will  pass  through  the  filter  while  a large  fraction  of  a voiced-speech 
signal  will  bo  attenuated.  Similarly,  if  the  cutoff  of  the  voiced-speech  signal  is  alxive  2000  Hz, 
then  most  of  its  energy  will  pass  tlirough  the  voiced  filtei',  while  a substantial  fraction  of  an 
unvoiced-speecli  signal  will  be  attenuated.  From  this  point  of  view  it  can  be  seen  that  it  is  cru- 
cial that  tlie  input  data  to  the  classifier  not  be  pre-emphasized  since  the  higher  formants  of  a 
voiced-speech  signal  would  take  on  the  attributes  of  an  unvoiced-speech  waveform  at  the  expense 
of  good  classifier  performance.  Therefore,  if  pre-emphasis  is  to  be  used  for  speech  analysis 
and  synthesis,  the  data  will  have  to  undergo  digital  de-emphasis  prior  to  speech  classification. 

On  tlie  basis  of  tlie  preceding  arguments,  the  W iencr  fUter  for  unvoiced  speech  will  be  ap- 
proximated by  a high-pass  linear-phase  digital  filter  whose  cutoff  frequency  is  below  1250  Hz. 

For  voiced  speech,  a low -pass  linear-phase  digital  filter  having  a cutoff  frequency  above  2000  Hz 
w ill  be  used.  The  linear-phase  requirement  is  essential  since  the  temporal  properties  of  the 
waveforms  must  be  preserved  in  order  that  a meaningful  correlation  operation  be  obtained.  The 
practical  implementation  of  the  optimum  classifier  against  nliite  noise  is  show  n in  Fig.  VI -2. 

The  detailed  characteristics  of  the  linear-phase  filters  are  provided  in  the  Conclusions  of  this 
section  on  p.  73. 

Implicit  in  the  realization  Illustrated  in  Fig.  VI-2  is  the  estimation  of  the  pitch  period  of  a 
voiced  waveform  so  that  the  additional  discrimination  iniieront  in  the  comb  filter  can  be  exjiloited. 
A further  simplification  in  processor  complexity  can  be  obtained  simjily  by  omitting  tlie  comli 


Fig.  V 1-2.  I’ractical  realization  of  optimum  speecIi  classifier. 


57 


I'Utoi-  and  rclyinf;  on  the  spectral  orthogonality  of  the  two  speech  types.  However,  since  the 
periodicity  of  the  voiced-speech  process  is  a potentially  powerful  classification  discriminant, 
for  theoretical  completeness,  it  is  worthwhile  to  develop  a practical  algorithm  to  exploit  it. 

Since  this  necessitates  an  estimate  of  the  pitch  period,  a brief  exposition  of  an  optimum  pitch- 
estimation  algorithm  will  be  presented. 

i 

K.  PITfll  KSTIMATION 

Voiced  speech  was  modeled  as  a periodic  random  process  in  the  sense  that  R^(k)  = R (k  4 T) 
for  some  pitch  period  T.  Tliis  means  that  almost  every  sample  function  in  the  ensemble  is  peri- 
odic with  period  T.  Therefore,  the  voiced  speech  signal  v(n)  can  be  modeled  as 


= 'l<"'mod'] 


(VI-20) 


wliere  <1(1 ),  q(2) (iCn  are  completely  unknown.  Of  course,  to  be  faithful  to  the  random  pro- 

cess formulation  of  voiced  speech,  the  quantities  q(k)  should  be  treated  as  correlated  random 
variables.  However,  to  keep  the  estimation  problem  mathematically  tractable,  the  correlation 
properties  will  be  ignored  at  first.  The  voiced-speech  data  are  therefore  taken  to  be 


y(n)  - v(n)  t w(n) 


(VI-21) 


where  w(n)  represents  white  Gaussian  noise,  and  v(n)  is  given  by  Eq.  (VI-20).  Based  on  N sam- 
ples of  these  data,  the  parameters  q(l),  q(2), . . . , q(T)  and  T are  to  be  estimated. 

The  above  formulation  of  tlie  pitch-estimation  problem  was  formulated  and  solved  by  Wise, 

10 

Caprio,  and  Parks.  Hsing  the  maximum -likelihood  estimation  rule  they  minimized  the  cost 
function 


0(q,  T) 


(y(n)  - v(n)l 


N T M-1 

^ y^(n)-2  ^ y(k  4 mT)  v(k  -t  mT) 

n=l  k=l  m=0 

T M-1 

4 ^ ^ v^(k  4 mT) 

k=l  m=0 


(VI-22) 


In  order  to  simplify  the  derivation,  it  has  been  assumed  that  N = MT,  M an  integer.!  From  the 
periodicity  condition  v(k  + mT)  - then  Eq.  (VI-22)  reduces  to 


T M-1 


0(q.  T)=  ^ y^(n)-2  ^ q(k)  ^ y(k  4 mT)  4 M V q2(u)  . (VI-23) 


k=  1 m=0 


t The  more  general  case  is  tedious,  and  contributes  little  to  the  final  result. 


58 


Since  the  basic  voiced-speech  waveform  q(l), . . . , q(T)  has  been  assumed  completely  unknown 
(i.e.,  the  correlation  properties  have  been  ignored^),  then,  lor  the  fixed  T,  the  minimizing 
values  are  obviously 

M-1 

qfk)  = ^ y(k  + mT)  . (VI-24) 

m=0 

The  estimate  of  the  voiced -speech  waveform  is  therefore 

0(n|N)  = (VI-25) 

where  the  notation  v(nj  N)  is  used  to  denote  the  fact  that  all  N measurements  y(l),  y(2),  ....  y(N) 
are  used  in  developing  the  estimate  of  the  voiced-speech  waveform  v(n),  n < N.  In  that  sense, 
the  estimator  is  unrealizable.^  Tlie  corresponding  minimum  value  of  the  likelihood  function  is 

N 

U(T)  = [y(n)  - v(n|  N)I^  (VI-26a) 

n=  1 

N N 

= I y^(n)  - ^ 0^(n|N)  . (VI-26b) 

n=i  n=l 

Since  v(n(  N)  can  be  interpreted  as  the  output  of  a comb  filter  tuned  to  pitch  period  T when  y(n) 
is  the  input,  then  the  second  term  in  Eq,  (VI-26b)  simply  represents  the  energy  at  the  output  of 
this  comb  filter.  Therefore,  the  optimum  estimate  of  the  pitch  period  can  be  obtained  by  con- 
structing a bank  of  comb  filters  each  tuned  to  a slightly  different  pitch  period,  and  choosing  as 
the  estimate  the  pitch  corresponding  to  the  comb  filter  for  which  the  output  energy  is  largest. 

It  is  important  to  keep  in  mind  the  fact  that  voiced-speech  signals  are  at  best  quasi -periodic; 
hence,  there  is  a definite  limitation  on  the  number  of  periods  over  which  the  averaging  process 
is  a meaningful  operation.  Since  values  of  the  pitch  frequency  generally  fall  within  the  range 
70  to  300  Hz  corresponding  to  pitch  periods  3 to  15  msec  long,  and  since  the  time  required  for 
a significant  alteration  in  the  vocal  tract  is  approximately  20  msec,  there  can  be  1 to  7 repetitions 
of  the  voiced-speech  waveform.  Therefore,  the  number  of  periods  over  which  the  data  are  av- 
eraged is  a design  parameter  that  must  be  chosen  to  carefully  trade  off  the  estimation  accuracy 
and  the  quasi-periodic  nature  of  the  voiced-speech  waveform. 

A particularly  important  practical  case  corresponds  to  the  assumption  that  the  voiced -speech 
waveform  is  periodic  for  two  successive  periods.  In  this  case,  from  Eqs,  (VI-24)  and  (VI-2  5)  the 
maximum -likelihood  estimate  of  the  voiced-speech  signal  is 

v(n|N)  = I (y(n)  4 y(n  - T)]  (VI-27) 


11 

t The  more  general  case  is  treated  by  McAulay. 

1 A realizable  estimator  that  uses  only  the  data  up  to  time  n is 

, M-1 

v(n|n)  = S y(n  - MT) 
ni=0 


59 


which  from  Eq.  (VI-26)  results  in  tlie  residual  error 
N N 

D(T)  = ^ (y(n)  - v(n|  N)J^  = J-  V |y,n)  _ y,n  _ T) . (VI-28) 

1 n= 1 

The  estimate  of  the  pitch  period  is  then  the  value  of  T that  minimizes  D(T).  This  criterion  lias 
^ already  been  proposed  for  pitch  estimation  by  Moorer^^  and  by  Uoss  ^ except  that  the 

squared  difference  has  been  approximated  by  the  absolute  magnitude  difference  function  in  order 
to  achieve  greater  dynamic  range  and  computational  speed.  Experimental  results  have  shown 
that  the  quality  of  the  pitch  estimates  is  roughly  equivalent  to  that  of  the  cepstral  method,  and 
successful  operation  lias  also  been  demonstrated  in  strong  noise  environments.  For  this  reason, 
it  is  conjectured  that  Kqs.  (VT-24)  througli  (V(-26)  represent  a possible  solution  to  the  problem  of 
robust  pitch  estimation.  To  sec  this,  suppose  that  the  true  pitch  period  is  T^;  then,  the  observed 
data  are 

y(n)  = v(n;T^)  4 w(n)  (VI-29) 

where  v(n;T^)  = *l(*^^modT  ’ output  of  the  comb  filter  tuned  to  pitch  period  T is 

o 

M-1  M-1 

v(n;T)  = p v(n-mT;T^)4  ^ w(n-mT)  , (VI-30) 

m=0  m=0 

The  noise  signal  at  the  output  of  the  comb  filter  is 

M-1 

T)(n;T)  = ^ ^ w(n-mT)  . (VI-31) 

m=0 

As  long  as  the  correlation  time  of  the  noise  process  is  less  than  the  minimum  pitch  period  of 

2 2 

interest,  then,  if  w(n)  has  variance  a , ii(ri;T)  will  have  variance  <t  /M.  For  the  comb  filter 
tuned  to  pitch  T^,  the  output  signal  is 

0(n;T^)  = 4 „(n;T^)  . (VI-32) 

o 

Therefore,  there  is  an  M:  1 increase  in  SNR  as  a result  of  using  the  comb  filter.  Applied  to  the 
two-pulse  canceler  in  Eq.  (VI-29)  (i.e.,  the  AMDF'),  a 3-dB  improvement  in  SNR  is  obtained  for 
the  class  of  noise  processes  whose  correlation  times  are  less  than  the  minimum  pitch  period  of 
interest. 

Although  originally  proposed  as  a pitch  estimation  criterion  based  on  ad  hoc  considerations, 
the  maximum -likelihood  theory  shows  that  the  average  squared  difference  function  is  optimum 
and  robust  when  the  voiced-speech  waveform  is  modeled  as  a deterministic  quasi-periodic  wave- 
form with  periodicity  extending  over  two  periods.  The  major  limitation  in  using  the  two-pulse 
comb  filter  (i.e.,  the  AMDF)  is  the  not -infrequent  occurrence  of  pitch  doubling  which  occurs 
when  the  voiced  speech  is  periodic  for  at  least  four  pitch  periods.  At  the  expense  of  increasing 
the  length  of  the  speech  buffer,  an  M -pulse  comb  filter,  M ^ 3,  can  be  used  to  reduce  the  rate 
at  which  pitch  doubling  errors  occur. 


( 


I 


80 


Fig.  VI-3.  I’ractical  impleriK'ntation  of  oi.>tiinum  pitch  c.stimator. 


A further  enhancement  in  the  pitch  estimate  can  be  obtained  by  using  the  low  -pass  voim-n- 
speech  filter  to  increase  the  pitch  estimator  SNR.  This  corresponds  to  ex|iloitation  of  the  global 
correlation  jiroperties  of  voiced  speech.  The  approximate  matched  filter  configuration  of  the 
pitch  detector  is  shown  in  Fig.  VI-3. 

F.  THE  Ol’TIMl'M  CLASSIFIER  AGAINST  COLORED  NOISE 

There  are  several  examples  in  which  speech  in  nonwhite  acoustic  background  noise  can  be 
effectively  classified  using  the  algorithm  that  was  defined  to  be  optimum  against  white  noise. 

In  particular,  whenever  the  SNR  is  high,  the  white-noise  classifier  will  yield  acceptable  jjcrfor- 
mance.  There  are  some  cases,  particularly  if  the  SNR  is  low  and  the  noise  is  higlily  correlated, 
where  significant  improvements  can  be  achieved  by  taking  the  spectral  characteristics  of  the 
noise  into  account.  In  this  section,  the  structure  of  the  optimum  classifier  wUl  be  derived  for 
the  colored-noise  case  and  then  reasonable  practical  approximations  will  be  deduced  in  order  to 
simplify  the  complexity  of  the  signal  processor. 

For  this  classification  problem,  the  data  corresponding  to  hypothesis  11.  are 

y(n)  = s.(n)  + w^(n)  + w(n)  i = 1, 2,  3 (VT-3  5) 

where  w^(n)  denotes  the  colored  noise  present  on  all  three  hypotheses.  Note  that  a white-noise 
component  w(n)  is  also  incorporated  into  the  model  to  avoid  mathematical  problems  relating  to 
singular  solutions.  The  standard  approach  to  this  problem  is  to  precede  all  the  jirocessing  by 
a whitening  filter  and  then  apply  the  white -noise  solution.  This  was  the  approach  taken  by 
McAulay,*^  which,  although  mathematically  correct,  encounters  practical  difficulties  because 
the  whitening  filter  essentially  pre -emphasizes  the  speech  data.  As  has  already  been  discussed, 
this  can  cause  the  higher  formants  of  voiced  speech  to  acquire  the  same  attributes  as  unvoiced 
speech,  which  makes  classification  difficult.  AlcAulay  and  Yates  derived  an  estimator- 
correlator  classifier  that  does  not  require  a whitening  prefilter.  Drawing  on  their  results  and 
those  developed  in  Sec.  C above,  two  sufficient  statistics  arc  computed,  namely: 

N 

^zi  ' ^ z(n)  s.(n)  i = 2,  3 (VT-31) 

n-  i 


6i 


lien' 


s.(m)  ^ 1 lijin  - k)  y(k)  i ^ <J.  ^ (Vl-'iS) 

k-  - 

is  till'  liiu'ai-  least-squared  error  unrealizable  estimate  of  sJn)  based  on  the  data  y(n)  = s.(n)  4 
i)  ; w(n),  and  where 


z(n)  - ^ h^ln  - k)  y(k) 

k- 


(Vl-36) 


is  the  result  of  (lassint;  v(n)  throUf>h  the  elutter-rejection  filter  h^(n).  It  has  been  implicitly 
assumed  that  the  speech  and  noise  processes  are  independent  and  quasi-stationary.  The  trans- 
fer functions  of  the  filters  are*^ 


" Cdf)  . c;  (f)  4 N '2  1 = 2,  i 

I c o 

C (f)  N /2 

■ cm  ^ ■ c;  (f)  4 N /2 

c o c o 


(VI-37) 


(VI-38) 


where  tj^(f),  G,(f),  and  G^d)  represent  the  power  spectra  for  the  colored-noise,  unvoiced- 
speech,  and  voiced-speech  processes,  respectively.  The  second  term  in  Eq.  (\T-38)  is  pre- 
cisely the  linear  least-squares  unrealizable  estimator  of  w^(n)  based  on  the  signal  w (n)  + w(n). 
Therefore,  the  clutter  filter  attempts  to  remove  the  colored  noise  from  the  data  before  perform- 
ing the  correlation  operation.  The  optimum  classifier  structure  is  shown  in  P’ig.  VI-4.  The 
classification  rule  is  similar  to  that  derived  for  white  noise,  Eq.  (VI-9),  except  that  the  suffi- 
cient statistics  are  now  t and  I instead  of  i and  t 

^2  ^3  ^2  ^3 


[ UNVOICED  1 |U-?-l3i19l 


Fig-  VI-4.  Optimum  speech  classifier  against  colored  noise. 


G.  PRACTICAL  IMPLEMENTATION  OF  THE  ESTIMATOR-CORRELATOR 
SPEECH  CLASSIFIER 

The  arguments  for  simplifying  the  processing  of  voiced  and  unvoiced  speech  proceed  along 
the  same  lines  as  those  made  for  the  white-noise  case.  In  particular,  if  knowledge  of  pitch  is 


(.2 


availablo,  the  spectral  harinonics  of  voicej  speech  are  matclied  tjy  using  a conil)  ILUcr  in  cas- 
cade with  ttie  Wiener  filter  desigiu'd  on  tlie  basis  of  tlie  s|)ectral  envelope.  Tlierefore,  tlie 
voiced-speech  estimator  filter  is 


IMO 


C(f;T) 


0^(f) 

ci^(f)  . c;^(f)  . 


(VI-3V) 


wliere  (.'(f;T)  is  the  comb  filter  tuned  to  the  most-recent  pitc/i  estimate  T.  For  unvnicvd  .spcocij, 
the  estimator  filter  is^ 


G^(f) 

“u'*’  " G^(f)  + Gplf) 


l\  1--10) 


backing  knowledge  of  the  exact  form  of  C.^(f)  and  G^^(f),  a good  first  approximation  is  to  use  the 
linear-pltase  low-pass  (cutoff  above  2000  Hz)  and  liigli-pass  (cutoff  below  1250  Hz)  filters  in  t!ie 
voiced-  and  unvoiced-spcecii  channels  as  was  done  in  tiie  white-noise  case.  This  insures  tlie 
spectral  orthogonality  of  the  two  speech  channels  and  cniiances  the  spcecii-to-noiso  ratio  wlien- 
ever  the  noise  spectrum  lies  outside  tite  filter  passbands.  For  colored  noise,  liowcver,  it  is 
possible  that  all  the  noise  energy  will  lie  within  the  filter  passbands,  in  wliich  case  no  speccli 
enliancement  will  occur  if  only  the  fixed  filters  are  used.  Somehow,  additional  processing  tuned 
to  reject  the  clutter  will  have  to  precede  the  fixed  filters  in  tlie  speech  channels.  To  develop  a 
clue  as  to  the  form  of  the  clutter  processor,  it  is  necessary  to  re-examine  liqo.  (VI-39)  and 
(VI-40).  Letting  G^lf)  = G^^If)  and  Gj(f)  = G^(f),  then  the  unvoiced-  and  voiced-spoecli  W iencr 
filters  can  be  written  as 


G.(f) 

' G.(f)  + G^(f) 


(VI-41) 


Realization  of  these  filters  requires  that  the  speech  and  noise  spectra  be  known.  Since  the  noise 
statistics  can  be  measured  during  the  silent  intervals,  it  is  reasonable  to  .assume  that  the  clutter 
spectrum  is  known.  Unfortunately,  a priori  estimates  of  the  siieech  spectra  are  not  available 
unless  long-term  averages  are  determined  from  training  sets.  When  detailed  knowledge  of  tlie 
frequency  distribution  of  the  speech  is  unavailable,  a conservative  approach  is  to  model  the 
speech  as  white  noise  thereby  having  a flat  spectrum.  Letting 

G.(f)  = (v.  i=2,  3 (Vl-42) 


and  substituting  this  into  Eq.  (VI-41)  results  in  the  filters 


11^(0 


+ n. 


i = 2,  3 


(V 1-4  3) 


Since  lL(f)  a 0 whenever  G^(f)  » and  IL(f)  a 1 whenever  G^,(f)  « <v  Fxi.  (\T-43)  can  be  inter- 
preted as  a notch  filter  tuned  to  reject  "most"  of  the  clutter  energy.  When  the  speccli-to-noise 
ratio  is  large,  little  clutter  rejection  is  needed  and  <Vj  should  be  large,  since  this  results  in  a 
passband  filter.  When  the  speech-to-noise  ratio  is  small,  then  the  clutter  must  be  rejected 
whatever  the  cost  in  speech  distortion,  which  necessitates  a small  value  for  iv  - It  follows, 
therefore,  that  the  parameter  a ^ should  be  proportional  to  the  specch-to-noise  ratio.  Since  the 

t The  effects  of  the  artificial  white-noise  term  have  been  neglected  at  this  point,  since  there  is 
no  problem  with  singular  solutions. 


'I 


clutter  (X)wer  is  known  from  the  silent  intervals,  estimates  of  the  speech -to -noise  ratio  can  be 
made  from  the  data  frame  being  analyzed.  In  this  mode,  the  distinction  between  voiced  and  un- 
voiced speech  disappears  and  only  a single  parameter  value  and  clutter  filter  need  be  determined. 
In  this  sense,  the  clutter  filter  represents  an  adaptive  prefilter  wliose  output,  in  a conservative 
sense,  represents  the  best  available  estimate  of  the  speech  waveform. 

The  results  of  this  discussion  are  summarized  in  I'ig.  Vl-S  which  shows  the  practical  real- 
ization of  the  optimum  classifier  operating  against  a colored-noise  background.  Except  for  the 
clutter  filters  in  the  reference  and  speech  channels,  the  processing  is  identical  to  that  used  in 
the  white-noise  case.  Since  selection  of  the  tuning  parameters  and  depends  on  the  noise 
statistics,  further  discussion  regarding  their  selection  will  he  deferred  to  .Sec.  M below. 


E'ig.  VI-5.  Practical  realization  of  optimum  speech  classifier 
against  colored  noise. 


The  only  problem  that  remains  to  be  discussed  is  the  calculation  of  the  clutter- filter  im- 
pulse response  from  Eq.  (VI-43).  The  most  straightforward  approach  is  to  solve  the  Wiener- 
Hopf  equation 

oo 

L - j)  + o6(k  - j)]  = (v6(j)  _ « < j < « . (VI-44) 

k=-oo 

If  the  impulse  response  is  truncated  at  ±p,  the  2p  + 1 coefficients  can  be  found  by  solving 
Eq.  (VI-44)  numerically  using  the  Levinson  Recursion.  Another  approach  is  to  fit  an  all-pole 
spectrum  to  G^(f)  + a using  Linear  Prediction  techniques  and  use  the  spectral  coefficients  to 
determine  the  clutter  filter.  For  this  method,  the  LPC  spectral  estimate  of  G^(f)  + a can  be 
obtained  by  solving 

P 

2 ~ ~ l<i<P  • (VI-45) 

k=l 


I 


64 


This  equation  can  be  solved  efficiently  using  the  Levinson  Recursion  and  results  in  a p-pole  fit 
to  the  clutter  spectrum.  The  estimated  spectrum  is 


1 


! 


r 


a 

A(z)  A*(z) 


where 


(VI-46) 


P 

A(z)  = 1 - ^ a^z''^  (Vl-47) 

k=l 

which  corresponds  to  the  Inverse  Filter  in  the  usual  LPC  analysis.  Substituting  Eq.  (VI-46)  into 
Eq.  (VI-43)  results  in  the  Wiener  filter 

H(z)  = ^ A(z)  A*(z)  . (VI-4fe) 

Letting  y(n)  denote  the  input  sequence  and  s(n)  the  output  sequence,  then 
S(z)  = I A(z)  A*(z)  Y(z) 


= ^ A(z)  X(z) 


(VI-49) 


where 

X(z)  = A*(z)  Y(z) 

Since  the  LPC  coefficients  {aj^}  are  real 


A»(z)  = 1 - 

k=l 


and 


P 

x(n)  = y(n)  - ^ 
k=l 


P 

s(n)  = ~ [x(n)  - Yi  k)] 

k=l 


(VI-50) 


(VI-51) 


(VI-52) 


(VI-53) 


Therefore,  the  unrealizable  Wiener  filter  can  be  implemented  by  the  cascade  combination  of  an 
inverse  filter  that  operates  on  p samples  of  future  data  and  an  inverse  filter  that  operates  on  p 
samples  of  past  data.  A p-sample  buffer  must  therefore  be  available  to  provide  for  the  future 
data.  The  advantage  of  this  approach  is  that  the  length  of  the  impulse  response  is  completely 
determined  on  the  basis  of  the  number  of  poles  required  to  fit  the  clutter  spectrum. 


H.  EXPERIMENTAL  RESULTS 

The  signal -processing  concepts  developed  in  the  previous  sections  were  evaluated  experi- 
mentally using  speech  data  that  were  corrupted  by  Airborne  Command  Post  (ACP)  noise.  Not 
only  does  this  provide  a good  pedagogical  tool  for  illustrating  the  filtering  ideas,  but  it  reprc.sents 


65 


an  important  real-world  speech-encoding  environment  which  is  not  adequately  solved  using  state- 
of-the-art  vocoder  technology. 

The  noisy-speech  data  were  sampled  every  132  psec  (7575  Hz),  and  158  samples  were  col- 
lected to  define  a 20-msec  frame.  Figure  VI-6(a)  illustrates  a 20-msec  sample  function  of 
ACf’  noise;  Fig.  VI-6(f)  is  a plot  of  tiie  magnitude  of  its  Fourier  transform  measured  in  decibels. 
The  correlation  function  of  the  m**’  frame  (i.e.,  the  current  frame)  of  noise  data  was  computed 
from 

N-l-k 

Ry(k;m)  = ^ x(n)  x(n  + k)  k = 0,  1.  . . . . p ; m = 1.  2.  . . . (VI- 54) 

n-0 


where  x(n)  is  the  Hamming  weighted  version  of  the  input  data  y(n).  A first-order  smoothed  cor- 
relation function  was  then  computed  from 


R^(k;m) 


[R  (k;m)  + yR  (k;m  - 1)] 
y c 


(VI-55) 


In  general,  the  weighting  constant  y should  be  chosen  to  reflect  the  quasi-stationarity  of  the 
noise  random  process.  For  ACP  noise,  y = 0.95  was  chosen  arbitrarily  and  seemed  to  produce 
good  results. 

From  Eq.  (VI-38),  the  clutter  filter  in  the  reference  channel  was  given  by 

o^(m) 

H^(z;m)  = G^(z)  + n^(m)  ' (VI- 56) 


The  impulse  response  was  found  using  Linear  Prediction  techniques  as  described  in  the  previous 
section.  This  necessitates  solving  the  Wiener-Hopf  prediction  equation 

P 

2 \ ~ l<j<P  (VI-57) 

k=l 

using  the  long-term  averaged  correlation  function  computed  at  the  last  frame  (i.e.,  the  m**’  frame). 
A whole  class  of  clutter  filters  can  be  obtained  simply  by  varying  the  parameter  a (m).  Typical 
transfer  functions  from  this  class  are  shown  in  Fig.  VI-6(g)  for  three  values  of  It  was  found 

that  the  clutter  filter  defined  for  the  value  a^(m)  = R^(0;m)  worked  well  for  ACP  noise.  P’or 
other  noise  types,  other  values  would  probably  be  more  appropriate.  A little  experimentation 
is  therefore  required  to  tune  the  clutter  filter  to  different  noise  processes. 

The  unvoiced-  and  voiced-speech  channels  are  preceded  by  another  clutter- rejection  filter 
given  by  Eq.  (VI-43),  namely. 

Of  (m) 

= G^(z)  4 a^(m)  (VI-58) 

where  a is  chosen  to  be  proportional  to  the  speech-to-noise  ratio  measured  for  the  current 
® til 

frame  of  data  (i.e.,  the  m frame).  Since  Ry(0;m)  represents  a measure  of  the  speech-plus- 
noise  energy  for  the  current  frame  of  data,  and  since  R^(0;m)  represents  a measure  of  the  long- 
term averaged  noise  energy,  then  a reasonable  estimate  for  the  speech-to-noise  energy  is 

|(m)  = R (0;m)  - R (0;m)  . (VI-59) 

y ^ 


67 


It  is  possible  that  the  energy  in  any  one  20-msec  sample  function  will  be  less  than  the  average 
clutter  energy,  especially  if  that  sample  function  contains  noise  alone  or  noise  plus  unvoiced 
speech.  Therefore,  provision  must  be  made  to  bound  the  clutter-notch  parameter  a ^ away  from 
zero.  A reasonable  scheme  is  to  pick 

ag(m)  = max  [i(m).  n^(m)]  (VI-60) 

which  guarantees  that  the  speech-clutter-fUter  notch  will  never  be  deeper  than  that  in  the  refer- 
ence channel.  As  before,  the  impulse  response  was  found  using  the  Linear  Prediction  power 
spectrum  which  was  obtained  by  solving  the  Wiener-Hopf  predictor  equation  (VI-57)  using 

instead  of  a . 

c 

The  output  of  the  speech  clutter  filter  was  then  used  as  the  input  to  the  high-  and  low-pass 
filters  characterizing  the  unvoiced-  and  voiced-speech  processing  channels,  respectively.  The 

filters  were  both  21 -tap  linear-phase  digital  filters  designed  using  the  Parks-McClellan  algo- 
1 s 

rit.hm.  The  impulse  responses  and  frequency  characteristics  are  specified  in  the  Conclusions 
of  this  section  on  p.  73.  No  attempt  was  made  to  optimize  the  filter  design.  The  outputs  of  the 
reference -channel  clutter  filter  z(n)  and  the  unvoiced-  and  voiced-speech  filters  u(n),  v(n)  are 
shown  in  Figs.  VI-6(b)  through  (d).  According  to  Eq.  (VI-34),  the  outputs  of  the  speech  filters 
were  then  correlated  with  the  output  of  the  reference -channel  clutter  filter  to  form  the  detection 
statistics; 

N 

f^(m)  = ^ z(n)  u(n)  (VI-6  la) 

n=l 

N 

^ z(n)  v(n)  (VI-6lb) 

n=l 

f j(m)  = f^^(m)  - iy(m)  . (VI-6lc) 

It  should  be  noted  that  the  comb  filter  has  been  left  out  of  the  voiced- speech  processing  channel. 
This  decision  was  made  to  show  that  good  classifier  performance  could  be  obtained  without  hav- 
ing to  make  a pitch  estimate  which  simplifies  the  classifier  processing,  which  is  necessary  for 
some  applications. 

The  detection  thresholds  were  obtained  by  driving  the  system  with  ACP  noise  for  15  data 
frames  (0.3  sec).  This  is  the  only  training  cycle  required  by  the  processor,  and  should  be  rel- 
atively easy  to  meet  in  practice  because  there  is  always  a speech-free  interval  before  a talker 
actually  speaks  into  the  encoding  device  after  having  turned  the  machine  on.  Averaged  detection 
statistics  for  the  training  noise  are  computed  from 

7.(m)  = - y~  [i  .(m)  1 y7.(m  - 1)]  i = 1,  2,  3 (VI-62) 

1 - y 

with  y = 0.95  as  before.  The  detection  thresholds  were  then  chosen  to  be 
X.(m)  = l.sFjIm)  i = 1,  2 

X^(m)  = f j(m)  - (VI-63) 


68 


which  allows  for  moderate  statistical  fluctuations.  After  the  first  15  data  frames  of  noise  have 
been  proce.ssed  (m  = 15)  and  the  initial  threshold  setting  computed,  the  classification  process 
is  initiated.  Tlie  next  frame  of  data  is  processed  and  the  detection  statistics  /^(m  t 1)  are  com- 
puted. If  /^(m  + 1)  < A^(m),  and  + 1)  < A^Im),  then  the  data  are  classified  as  silence  and 

tlie  clutter  correlation  function  (VI-55)  and  the  detection  thresholds  (VI-62)  and  (Vl-65)  ai'c  up- 
dated. If  i^(m  + 1)  > X^(m),  or  t ^(m  + 1)  >X2(m),  then  speecti  is  declared  present  and  neither 
the  clutter  correlation  function  nor  the  detection  thresholds  is  ctianged.  No  updating  is  done 
until  the  next  frame  of  silence  is  detected.  This  procedure  allows  the  classifier  to  track  noist' 
processes  whose  statistics  vary  slowly  with  time.  Such  a classifier  structure  is  often  referred 
to  as  a decision-directed  detector  since  it  tells  itself  when  to  alter  its  structure.  It  becomes 
evident,  therefore,  that  the  detection  thresholds  should  be  set  low  even  at  the  expense  of  a higli 
false-alarm  rate  (declaring  noise  as  speech  is  a false  alarm).  It  would  be  a more  serious  error 
if  the  classifier  declared  speech  as  noise  since,  then,  all  the  clutter  filters  and  detection  thresh- 
olds would  be  tuned  to  reject  speech.  Fortunately,  this  malign  event  rarely  occurred  for  Afl' 
noise,  and  when  it  did  the  noise  always  completely  overpowered  the  speecti  so  that  little  change 
in  the  filter  structures  occurred. 

The  effects  of  the  three  filtering  channels  on  the  three  speech  types  will  be  examined  for 
some  typical  cases  to  develop  a feeling  for  the  classifier  operation.  Figure  \T-6(a)  is  a plot  of 
a 20-msec-input  sample  function  of  ACP  noise;  Fig.  VI-6(f)  is  the  corresponding  short-term 
power  spectrum.  Figure  VI-6(g)  is  a plot  of  the  adaptive  clutter-filter  transfer  function  in  the 
reference  channel  (the  adaptive  prefilter).  For  ACP  noise  input,  it  has  adapted  in  such  a way 
as  to  make  a — 10-dB  null  at  the  clutter  frequencies.  Figures  VI-6(b)  through  (d)  show  the  re- 
spective outputs  of  the  reference  channel,  the  high-pass  filtered  unvoiced-speedi  channel,  and 
the  low-pass  filtered  voiced-speech  channel. 

As  was  described  in  the  previous  section,  the  output  of  the  speech-channel  clutter  filter 
represents  a minimum  mean-squared-error  estimate  of  the  input  speech.  Figure  VI-6(e)  shows 
a plot  of  the  prefilter  output  in  response  to  ACP  noise  at  the  input.  Of  course,  with  high  prob- 
ability the  classifier  will  classify  the  frame  as  silence;  hence,  one  has  the  option  of  setting  the 
prefilter  output  to  zero,  which  removes  the  residual  noise  completely. 

Although  the  comb-filter  discriminator  was  not  used  in  the  classifier,  it  remains  of  interest 
to  evaluate  the  robustness  of  the  maximum-likelihood  pitch  estimator  in  ACP  noise.  This  was 
done  by  applying  the  output  of  the  low-pass  filter  v(n)  to  a bank  of  two-pulse  comb  filters  cover- 
ing the  range  from  70  to  300  Hz.  Figure  VI-6(h)  is  a plot  of  the  energy  at  the  output  of  the  comb 
filters  as  a function  of  the  pitch  period  for  the  ACP  noise  sample. 

The  same  sequence  of  data  is  plotted  in  Figs.  VI-7(a)  through  (h)  and  VI-8(a)  through  (h)  for 
20-msec  frames  of  unvoiced  and  voiced  speech,  respectively.  Figures  VI-7  (a)  and  (f)  show  that 
the  unvoiced-speech-to-noise  ratio  is  less  then  0 dB  (it  is  roughly  —3  dB),  yet  Fig.  VI-7(e)  shows 
that  the  prefilter  has  removed  a significant  portion  of  the  clutter  waveform  while  allowing  the 
unvoiced-speech  waveform  to  pass  relatively  undisturbed.  Figures  VI-8(a)  and  (f)  show  that  the 
voiced-speech-to-noise  ratio  is  quite  large  (it  is  roughly  9 dB).  Figure  VI-8(g)  shows  that  the 
prefilter  transfer  function  is  adjusted  to  allow  most  of  the  speech  to  pass,  even  though  its  spec- 
trum overlaps  that  of  the  ACP  noise.  This  shows  the  advantage  of  the  adaptive  prefilter.  Had 
a fixed  clutter  filter  been  used,  the  voiced-speech  waveform  would  have  been  distorted  unnec- 
essarily. Figure  VI-8(h)  shows  that  the  pitch  estimate  is  perturbed  very  little  by  the  presence 
of  ACP  noise.  In  general,  the  only  significant  pitch  errors  found  were  the  effects  of  pitch 


69 


TIME  — *-  — ► 


fig.  Vl-S(a).  Voiced-speech  Fig.  Vl-fi(h).  lieference  channel  output, 

sample  function. 


F'ig.  VI-8(c).  I'nvoiced-speech  Fig.  VI-8(d).  Voiced-speech  channel  output, 

channel  output. 


Fig.  VI-8(e).  Frefilter  output. 


Fig.  VI-8(f).  Voiced-speech 
power  spectrum. 


FREQUENCY  PITCH  PERIOD 

Fig.  VI-8(g).  I'refilter  frequency  response.  Fig.  VI-8(h).  Comb-filter  response. 


71 


TABLE  VI-1 

CLASSIFIER  PERFORMANCE  STATISTICS 


Eslimafed 

True 

Silence 

Unvoiced 

Voiced 

Silence 

405 

14 

24 

Unvoiced 

4 

43 

2 

Voiced 

1 

5 

170 

Unvoiced-Voiced 

0 

0 

6 

FREQUENCY 

Fig.  VI-9.  Unvoiced-speech  prefQter 
output  spectra. 


FREQUENCY 

B’ig.  VI-10.  Voiced-speech  prefilter 
output  spectra. 


r. 
li  ■ 

li 


doubling  which  occurred  intermittently  near  the  ends  of  a voiced  sound.  Figure  VI-8(e)  shows 
how  the  prefilter  attempts  to  reproduce  the  voiced-speech  waveform. 

Having  established  the  basic  characteristics  of  the  classifier,  our  next  step  is  to  evaluate 
the  frame-to-frame  performance  when  an  ACP  noise-corrupted  utterance  is  applied  to  the  input. 
Classification  errors  were  obtained  by  determining  the  true  speech  type  by  visually  examining 
the  waveform,  power  spectrum,  and  comb-filter  energy  contour  for  each  20-msec  sample  func- 
tion. Statistics  were  accumulated  for  a total  of  three  utterances  spoken  by  three  male  speakers 
in  different  ACP  noise  environments.  The  results,  from  wliich  tlio  false-alarm  probability  (de- 
clare speech  given  silence)  is  estimated  to  be  9.4  percent,  are  tabulated  in  Table  VI-1.  The 
miss  probability  (declare  silence  given  speech)  is  2.3  percent.  The  misses  mainly  occurred 

for  unvoiced  speech  that  had  been  completely  overpowered  by  the  noise  ( 10-dl3  speech-to- 

noise  ratio).  Erroneous  classifications  ( voiced  — unvoiced)  occurred  at  the  rate  of  3 percent. 
Whenever  a frame  represented  a mixture  of  voiced  and  unvoiced  speech,  the  classifier  always 
chose  in  favor  of  voiced  speech.  This  event  could  be  reduced  significantly  by  reducing  the  frame 
period  (10  vs  20  msec).  Although  these  statistics  have  been  gathered  for  a relatively  small  en- 
semble, the  general  impression  is  that  the  performance  is  quite  good. 

Another  aspect  of  the  experimental  program  was  the  recovery  and  synthesis  of  noise- 
corrupted  speech  using  Linear  Prediction  techniques.  The  voiced -unvoiced  decisions  and  the 
pitch  estimates  were  derived  using  the  methods  described  here.  The  LPC  filter  coefficients 
were  estimated  from  the  prefilter  output  waveform.  For  the  case  of  noise-corrupted  unvoiced 
speech.  Fig.  VI-7(a)  for  example,  the  prefilter  output  is  shown  in  Fig.  VI-7(e).  Its  short-term 
power  spectrum  is  shown  in  Fig.  VI-9  which,  when  compared  with  that  for  the  input  unvoiced 
speech  plus  ACP  noise  — Fig.  VI-7(f),  clearly  demonstrates  the  action  of  the  adaptive  prefilter 
in  eliminating  the  clutter.  The  LPC  power-spectrum  estimate  is  also  plotted  in  Fig.  VI-9  and 
shows  that  the  synthetic  speech  is  likely  to  reproduce  the  original  unvoiced  speech.  Of  course, 
the  ACP  noise  will  cause  the  spectral  estimate  to  be  somewhat  distorted,  but  the  perception  of 
the  additive  ACP  noise  will  have  disappeared.  It  is  for  this  reason  that  the  synthetic  speech  is 
perceived  to  be  "noise-free." 

Similar  results  are  obtained  for  the  voiced-speech  sample  function  shown  in  Fig.  VI-B(a). 

The  short-term  power  spectrum  of  the  prefilter  output.  Fig.  VI-8(e),  is  plotted  in  Fig.  VI- 10 
and  should  be  compared  with  the  voiced  speech  plus  noise  power  spectrum  shown  in  Fig.  VI-8(f). 
The  corresponding  LPC  spectrum  shown  in  Fig.  VI-10  shows  the  distortion  in  the  first  format 
due  to  the  presence  of  the  ACP  noise. 

LPC  synthetic  speech  was  generated  for  a number  of  utterances  recorded  in  ACP  noise. 
Compared  with  LPC  speech  in  which  no  adaptive  prefiltering  was  employed,  an  improvement  in 
intelligibility  was  obtained. 


I.  CONCLUSIONS 

Using  statistical  decision  theory,  a new  speech-classification  algorithm  has  been  developed 
in  the  form  of  an  estimator-correlator  receiver.  The  structure  is  robust  in  the  sense  that  it 
can  adapt  to  time-varying  noise  fields  in  which  the  SNR  can  be  quite  low  (less  than  10  dB).  For 
noiseless  speech,  the  classifier  simply  involves  two  fixed  filters  and  requires  no  pitch  estima- 
tion or  linear-prediction-analysis  parameters.  For  noisy  speech,  clutter  filters  must  bo  added 
to  the  speech  and  reference  channels.  The  reference  clutter  filter  is  developed  on  the  basis  of 
an  initial  0.3-sec  sample  of  noise  data,  while  the  other  adapts  to  the  spcech-plus-noise  statistics 


j 

j 


I 


73 


L>NEAR  MAG 


TABLE  VI-2 

VOICED-  AND  UNVOICED-FILTER 

l 

MPULSE  RESPONSES 

Impulse 

Response 

Unvoiced  Filter 

Voiced  Filter 

f 

h(l) 

-0.2151 106  7E-0I 

-0.38655568E-02 

h(2) 

0. 55939741 E-02 

-0. 320536 79E-01 

h(3) 

0.21661893E-01 

0.23418449E-01 

hl4) 

0.39310634E-01 

0. 13665602E-01 

l>(5) 

0.45899481E-01 

-0.42199165E-01 

h(6) 

0.29383000E-01 

0. 73566064E-02 

h(7) 

-0.  15331455E-01 

0. 6605392 7E -01 

h(8) 

-0.821 91 288E-01 

-0.65457523E-01 

h(9) 

-0. 15448785E+00 

-0.  8454346 7E -01 

h(10) 

-0. 21035391 E+00 

0. 30347985E+00 

h(lt) 

0. 76869851 E too 

0.59147525E+00 

I'ig.  \ I-1 1.  I nvoiccd  )iigh-pass  filter.  Fig.  VI- 12.  Voiced  low-  ass  filter. 


74 


calculated  for  each  fi-ame.  If  a frame  is  classified  as  noise,  the  reference-ciiannel  filter  is  up- 
dated so  that  time-varying  noise  statistics  can  be  tracked. 

The  output  of  the  speecti-channel  clutter  filter  represents  an  improved  estimate  of  the  input 
speech  in  the  sense  that  much  of  the  additive  noise  has  been  canceled  from  the  signal.  Hy  apply- 
ing Linear  Prediction  tectmiques  to  ttiis  waveform,  more  intelligible  synthetic  speech  ca/i  Imj 
obtained. 

A relatively  thorough  (non-real-time)  evaluation  of  the  classifier  and  adaptive  prefilter  was 
conducted  for  AOP  noise,  and  surprisingly  good  results  were  obtained.  Based  on  a limited  num- 
ber of  listening  tests,  the  Ll'C  synthetic  speech  using  the  prefilter  output  was  found  to  be  more 
intelligible  than  the  LPC  synthesis  of  the  original  noisy  speech. 

No  attempt  was  made  to  optimize  the  design  of  the  fixed-voiced  (low-pass)  and  unvoiced 
(high-pass)  filters.  The  unvoiced-  and  voiced-speech  Wiener  filters  were  approximated  by 
21-tap  linear-phase  high-  and  low-pass  fUters  designed  using  the  Parks-McClellan  algoritlim. 
Impulse  responses  used  in  the  experimental  program  are  given  in  Table  VI-2  [h(n)  = h(-n)l. 

The  magnitude  of  the  frequency  responses  is  shown  in  Figs.  VI-11  and  VI-12.  A better  approach 
would  be  to  obtain  long-term  statistics  for  voiced  and  unvoiced  speech  and  pick  the  filter  length 
and  passband  edges  to  more  closely  represent  the  average  spectral  properties.  Another  useful 
study  would  be  to  investigate  the  possibility  of  using  recursive  filters  with  phase  compensation 
to  further  simplify  the  processing. 

Although  a first-order  attempt  was  made  to  improve  the  design  of  the  clutter  filters,  other 
methods  are  undoubtedly  possible.  Additional  insights  are  also  needed  in  the  selection  of  the 
clutter-filter  design  parameter;  in  this  report,  trial  and  error  were  used  to  make  the  selection. 

Of  course,  the  real  test  of  any  speech-processing  algorithm  is  obtained  in  a real-time  en- 
vironment. This  is  the  focus  of  the  current  effort. 


!• 


I 

1 


75 


KKFKRfcJNCKS 


d 


1.  li.  S.  At^^l  and  S.  F.  llanauer,  "Speech  Ajialysis  and  Synthesis  by  Linear 
Prediction  ot  tlic  Speech  Wave."  J.  Acoust.  Soc.  Am.  637  (1971). 

2.  J.  Makiioul,  " I. inear  I'rediction;  A Tutorial  Review,"  Proc.  U;LE 
561  (1975). 

3.  J.  13.  jVIarkel  and  A.  II.  Gray,  Jr.,  "A  L, inear  I'rediction  X'ocodcr  Sim- 
ulation Based  Upon  the  Autocorrelation  Method,"  IICKE  Trans.  Acoust., 
Speech,  and  Signal  I’rocessing  ASS1’~22,  124  (1974). 

4.  B.  S.  Atal  and  M.  R.  Schroeder,  "Adaptive  Predictive  Coding  of  Speech 
Signals."  BeU  Syst.  Tech.  J.  49.  1973  (1970). 

5.  B.  Gold,  "Robust  Speech  Processing,"  Technical  Note  1976-6,  Lincoln 
Laboratory,  M.I.T.  (27  January  1976),  DDC  AD-A021899/0. 

6.  B.  S.  Atal  and  L.  R.  Rabiner,  "A  Pattern- Recognition  Approach  to  Voiced- 
Unvoiced-Silence  Classification  with  Applications  to  Speech  Recognition," 
IEEE  Trans.  Acoust.,  Speech,  and  Signal  I'rocessing  ASSP-24,  201 
(1976). 

7.  H.  1..  \'an  Trees,  Detection.  Estimation  and  Modulation  Theory,  I'art  III 
(Wiley,  New  York,  1968). 

8.  Ibid. . Part  I (Wiley.  New  York,  1968). 

9.  N.  Levinson,  "The  Wiener  RMS  Error  Criterion  in  Filter  Design  and 
I'rediction,"  J.  Math.  Phys.  261  (1947). 

10.  J.  D.  Wise,  J.  R.  Caprio,  and  T.  W.  Parks,  "Maximum  Likelihood  Pitch 
Detection,"  Technical  Report  No.  7516,  Department  of  Electrical  Engi- 
neering, Rice  University  (17  October  197  5). 

11.  R.  J.  McAulay,  "Optimum  Classification  of  Voiced  Speech,  Unvoiced  Speech 
and  Silence  in  the  Presence  of  Noise  and  Interference,"  Technical  Note 
1976-7,  Lincoln  Laboratory,  M.I.T.  (3  June  1976),  DDC  AD-A028518/9. 

12.  J.  A.  Moorer,  "The  Optimum  Comb  Method  of  I’itch  Period  Analysis  of 
Continuous  Digitized  Speech,"  IEEE  Trans.  Acoust.,  Speech,  and  Signal 
Processing  ASSP-22.  330  (1974). 

13.  M.  J.  Ross.  H.  I..  Shaffer,  A.  Cohen,  R.  Freudberg,  and  II.  J.  Manley, 
"Average  Magnitude  Difference  Function  Pitch  Extractor,"  IEEE  Trans. 
Acoust.,  Speech,  and  Signal  Processing  ASSP-22.  353  (1974). 

14.  R.  J.  McAulay  and  R.  D.  Yates,  "Realization  of  the  Gauss-in-Gauss  De- 
tector Using  Minimum-Mean-Squared-Error  Filters,"  IEEE  Trans. 

Inform.  Theory  IT-17.  207  (1971),  DDC  AD-729596. 

15.  J.  H.  McClellan.  T.  W.  Parks,  and  L.  R.  Rabiner,  "A  Computer  Program 
for  Designing  Optimum  FIR  Linear  Phase  Digital  Filters,"  IEEE  Trans. 
Audio  Electroacoust.  AU-21,  506  (1973). 


I 


76 


w 


VII.  TANI  KMING  AND  IMPROVEMENT  OF  inGH-RATE  CODERS 


Investigations  of  improved  wideband  speech  digitizers  (CV.SD  and  APC)  and  improved  tandem 
arrangements  (CVSD-CVSD,  EPC-C'VSD)  required  the  design  of  two  distinct  software  systems 
on  the  Group  24  Pnivac-FDP  facility. 

For  comparison  of  standard  CVSD  and  modified  CVSD  algorithms,  a design  program  was 
written  for  the  FDP.  This  design  program  allowed  two  independent  CV.SD  encoders  to  run  on 
the  same  audio  input  samples  with  the  parameters  of  one  of  the  encoders  easily  modified.  Roth 
encoders  drove  a routine  which  computed  an  average  mean-square  error  between  the  input  and 
reconstructed  speech  waveforms.  In  this  way,  parameter  changes  in  the  CVSD  algorithm  could 
be  evaluated  by  comparing  the  resulting  mean-squared  speech  error  with  a "standard"  CVSD 
algorithm  error  for  a given  audio  input.  The  output  reconstructed  speech  from  both  modified 
and  unmodified  algorithms  could  be  played  out  of  the  FDP  digital-to-analog  (D/A)  converter  lor 
real-time  listening  to  the  two  outputs.  Finally,  the  input  waveform  and  both  reconstructed 
waveforms  could  also  be  displayed  for  comparison  and  quantitative  evaluation.  Another  program 
was  written  to  deal  with  the  wideband  tandeming  situation.  This  program  could  tandem  up  to 
four  CVSD  encoders  with  high-order  elliptic  low-pass  filters  between  encoders  to  smooth  inter- 
mediate waveforms  and  destroy  synchrony  between  encoders.  This  was  an  accurate  simulation 
of  tandemed  real-world  CVSD  units.  In  use,  the  first  program  described  was  used  to  converge 
to  an  "optimum"  set  of  parameters  (i.e.,  two  time  constants,  and  upper  and  lower  slope  changes), 
then  the  second  program  was  used  to  evaluate  the  "optimum"  encoder  in  a tandem  environment. 

CVSD  parameter  adjustment  at  16  kbps  was  studied  using  this  FDP-Univac  facility.  An 
audio  tape  was  generated  for  several  conditions  of  slope  clamping  and  slope  time  constant  as 
well  as  four  times  tandem.  The  following  points  can  be  made  to  summarize  the  study  and  the 
tape.  First,  there  appear  to  be  no  "optimum"  settings  for  the  high-  and  low-slope  clamps  and 
the  slope  time  constant,  in  the  sense  that  large  variations  in  these  parameters  (for  example, 

10;1  in  the  case  of  time  constant)  effect  the  output  speech  just  perceptibly,  and  probably  do  not 
effect  intelligibility  at  all.  The  upper  clamp  level  interacts  with  the  time  constant  to  the  degree 
that  recovery  to  wide-dynamic- range  changes  is  effected  by  a large  time  constant  and  a large 
upper  clamp.  Again,  this  is  noticeable  only  with  large  changes  in  parameters.  Probably  the 
most  noticeable  effect  of  parameter  changing  occurs  with  the  lower  clamp,  although  intelligi- 
bility is  degraded  only  in  tandem  use.  With  the  upper  clamp  set  at  some  reasonable  value,  and 
a time  constant  set  at  10  msec,  the  lower  value  can  be  varied  from  0 (no  lower  clamp)  to  some- 
thing like  40  dB  below  the  upper  clamp,  with  no  severe  change  in  the  output  speech.  What  does 
change  is  the  noise  in  speech  silences.  With  a lower  clamp  of  zero,  all  input  noise  is  encoded 
producing  noise  output  from  the  receiver.  When  the  lower  clamp  is  moved  up,  the  noise  in 
silences  is  suppressed  at  a reduction  in  dynamic  range.  From  the  point  of  view  of  tandem  op- 
eration, a lower  clamp  close  to  zero  allows  for  minimum  loss  of  speech  sounds  (minimum  loss 
of  dynamic  range)  at  a cost  of  some  noise  in  silent  intervals.  This  study  indicates  that  CVSD 
and  similar  devices  produce  a speech  SNR  reasonably  close  to  what  this  class  of  devices  is 
capable  of  producing. 

The  second  distinct  software  system  was  designed  to  investigate  tandem  interactions  be- 
tween wideband  and  narrowband  terminals.  The  system  allows  for  the  storage  of  a raw  digitized 
speech  signal  on  the  Univac  drum.  The  waveform  can  then  be  played  out  to  a speech  coder,  and 
the  coder  output  ran  also  be  stored  on  the  drum.  In  its  turn,  the  coder  output  can  be  played  out 


77 


to  a second  coder,  with  this  final  output  also  stored  on  the  fnivac  drum.  In  this  manner,  a 
tandeming  situation  can  be  set  up  in  a reproducible  fashion  with  a fixed-speech  utterance  as  a 
probe.  The  software  and  hardware  systems  as  set  up  allow  waveforms  to  be  displayed  in  flex- 
ible formats,  and  any  of  the  node  outputs  to  be  heard  out  of  the  D/ A con.erter.  This  system  was 
used  primarily  to  investigate  I. PC  output  waveforms  into  CVSn  coder  s.  This  tandem  combina- 
tion was  indicated  by  T&iE  results  to  be  a particularly  severe  one.  To  reduce  the  peaky  l.PC 
output  waveform  and,  in  turn,  reduce  the  CVSD  slope  overload  distortion,  the  l.PC  waveform 
was  synthesized  with  a modified  l.PC  filter.  The  modified  filter  was  obtained  bv  shifting  all  the 
filter  poles  closer  to  the  unit  circle  in  the  sampled  data  z-plane.  This  was  done  with  a simple 
filter  scaling  in  the  iteration,  and  has  the  effect  of  narrowing  the  bandwidth  of  all  the  poles.  The 
technique  was  of  limited  v'alue  and  introduced  severe  stability  problems  which  could  not  be 
solved  except  with  increased  computational  complexity.  Study  of  several  all-pass  phase- 
dispersion  filters  led  to  modest  improvement  in  overall  tandem  performance,  from  a waveform 
point  of  v'iew,  but  seemed  to  make  little  difference  in  listening  tests.  Finally,  a class  of  dis- 
persive filters  derived  from  the  radar  "FM  chirp"  concept  appeared  to  yield  some  improvement 
in  paired  listening  comparisons  of  LPC-CVSD  tandem  situations,  with  and  without  the  dispersion 
filters. 

Design  of  a (digital)  chirp  filter  can  be  understood  by  first  examining  the  mathematical  ex- 
pression for  an  analog  chirp; 


, . ttW  ,2 

h(t)  = sin-7jr  t 


(VlI-1) 


where  t is  time,  W is  the  chirp  bandwidth,  and  T is  the  duration  of  the  chirp.  To  show  that 

2 

W is  really  the  bandwidth,  note  that  cv  = rrWt  /T  is  the  instantaneous  phase,  and  therefore  the 
instantaneous  frequency 


. _ JL  ^ _ W 
’ ■ 27t  dt  T 


(Vll-2) 


Thus,  when  t = T,  the  frequency  has  chirped  all  the  way  up  to  W.  Intuition  tells  us  that  the 
signal  h(t)  should  have  a spectrum  that  extends  from  about  — W to  +W,  but  this  doesn't  tell  us 
enough  about  the  spectral  details. 

Digital  chirp  filters  are  not  quite  as  well  known,  although  there  has  been  some  discussion 
of  their  properties  in  a radar  context.^  A simple  way  to  translate  Eq.  (VII-1)  into  a digital  signal 
is  to  let  t = nTg  (where  T^  is  the  sampling  interval)  and  T - MT^  where  M is  the  number  of 
samples  in  the  signal;  then 


wW  2 2 ^(WT^)n 

s(n)  = h(nTg)  - sin  n T^  = sin — 

s 


(VIl-3) 


with  the  product  WT  defined  as  the  "chirp  constant."  Our  first  task  therefore  is  to  choose  (WT^) 
and  M for  best  results.  For  speech,  W should  be  about  3.5  kHz,  so  that  the  chirp  sweeps  out 
the  pertinent  audio  band.  M is  the  parameter  we  have  to  play  with;  a small  M will  yield  a very 
non-flat  spectral  response  and  distort  the  speech  spectrum,  while  a large  M tends  to  be  expen- 
sive to  implement. 


tL.R.  Rabiner  and  B.  Gold,  Theory  and  Application  of  Digital  Signal  Processing  (Prentice-Hall, 
Englewood  Cliffs,  New  Jersey,  1975), 


78 


CMiRP  CONSTANT  * 0 5 


! 

i 

k 

! 


[ 


-1 I — J L 


Fig.  VlI-1.  Amplitude  and  phase  characteristics  of 


34-tap  chirp. 


Figure  VII- 1 presents  a compromise  chirp  filter  in  magnitude  and  phase.  The  magnitude 
function  is  presented  from  0 to  ir,  which  represents  the  range  up  to  half  the  sampling  frequency. 
The  chirp  constant  (WT^)  of  0.5  represents  a chirp  right  out  to  half  the  sampling  frequency. 

The  linear-amplitude  scale  is  down  about  3 dB  at  both  high  and  low  ends.  The  phase  display  is 
that  of  the  appropriate  quadratic  function  for  the  chirp,  but  it  is  difficult  to  see  because  of  the 
± TT  representation.  Figure  VII-2  presents  the  impulse  response  of  a typical  chirp  filter.  For 
the  given  34- tap  filter,  the  overall  delay  dispersion  from  zero  frequency  up  to  the  half-sampling 
frequency  is  about  4.2  5 msec. 

This  34- tap  filter  as  well  as  a 46-  and  a 64-tap  version  were  included  in  the  tandem  dem- 
onstration of  LPC-chirp  filter-CVSD  delivered  to  Reston  at  the  end  of  FY  7T.  The  demonstra- 
tion allowed  for  comparative  listening  to  LPC-CVSI)  tandem  with  and  without  the  dispersive 
filters. 


|U-M330S| 


Fig.  VII-2.  Unit  sample  response  of  typical  chirp  filter. 


I 


79 


As  a result  of  our  tandeming  and  wideband  studies  as  described,  it  appears  tiiat  an  ongoing 
fcsearcli  strategy  must  be  directed  toward  improved  equality  ol  botli  the  narrow  and  wideband 
equipments  separately.  Any  attempts  to  improve  their  interoperability  seem  to  yield  very  small 
gains  not  at  all  comparable  to  the  overall  reduction  in  quality  and  intelligibility  the  tandem 
combinations  produce.  Some  early  work  this  past  year  on  wideband  APf  systems  produced 
hi gher -q uali ty  outputs  at  16  kbps  than  any  16-kbps  CVSO  devices.  However,  we  must  reduce 
API'  hardware  complexity  to  produce  a viable  wideband  alternative  to  CVSD. 


vrr[.  MK  ROPKOCI'.'SSOK  KKAUZATIOiN  (Jl-  A l.l.NKAU-PIU.DK  TIVK  VCJCOULH 
A.  INTROUPCTION 

For  t)ie  past  several  years  there  has  lieen  a treiiii  toward  the  realization  of  narrowband 
speech  terminals  in  tlie  form  of  small  general-purpose  digital  computers.  Tiiese  computers 
have  been  fast  enough  to  run  the  "real-time  code"  necessary  to  transform  them  from  general- 
purpose  computers  to  speech  terminals  capable  of  full  du()lex  operation  between  talker-listener 
and  modem.  This  approacli  was  necessitated  by  the  flux  in  narrowliatid  speecii  algorithms  dur- 
ing this  time.  As  a result  of  recent  work  in  linear  predictive  coding  (I.l’C)  techniques'’^  ap- 
plied to  the  analysis -synthesis  of  speech,  it  has  become  possible  to  specify  an  1 I'C  ajjproach^ 

which  produces  acceptable  narrowband  speech  in  tlie  range  from  2.4  to  4.H  kbps.  In  addition,  a 

4 

recent  project  at  Lincoln  Laboratory  provided  the  opportunity  to  implement  tiie  pertinent  LI’C 
code,  pitch-detector  code,  and  data -handling  code  in  a very  "loan"  manner  in  terms  of  program 
and  data  memory  use,  and  efficient  real-time  operation.  This  previous  experience  has  enaliled 
us  to  approach  the  design  of  a microprocessor-based  I.l-’t'  vocoder  with  full  knowledge  of  cacli 
subroutine  and  all  timing  sequences  needed  for  interaction  with  both  the  incoming  and  outgoing 
audio  data,  as  well  as  tlie  outgoing  and  incoming  digital  data  stream. 

Our  starting  goal  for  a microprocessor-realized  linear-predictive  vocoder  was  the  pro- 
duction of  a compact,  low-power,  inexpensive  device  using  commercially  available  integrated 
circuits.  We  were  willing  to  design  a completely  special-purpose  device  that  would  implement 
only  the  LPC  voice  teiminal  in  an  efficient  form.  In  addition,  there  was  no  consideration  of  cus- 
tom large -scale-integration  chip  use  since  the  costs  for  a limited  vocoder  market  appeared  too 
high,  and  no  small  set  of  chip  types  seemed  adequate.  In  effect,  the  goal  was  a benchmark  de- 
vice using  only  commercial  chips  whose  price  would  drop  with  tlie  larger  commercial  market. 
This  benchmark  device  could  then  be  used  in  larger  system  designs  as  a cheap  building  block, 
or  could  be  modified  and  expanded  to  include  modem  and  other  functions. 

Starting  with  a study  of  available  microprocessor  chip  sets,  a p..rticular  choice  was  made 
on  the  basis  of  speed,  signal-processing  power,  and  basic  chip  organization  (the  AM  11  2900  se- 
ries). Several  design  iterations  were  then  made  starting  with  a machine  using  tliree  separate 
microprocessor  C'FEs.  In  this  design  each  Cl’E  was  doing  a special-purpose  task,  and  was  feii 
from  separate  analog-processing  circuits.  Because  of  inefficiencies  associated  with  memory 
sharing  and  access,  this  design  evolved  to  a two  t'I’E  machine  which  was  physically  divided  into 
a transmitter  and  separate  receiver.  This  design  also  ap(Kmred  inefficient.  Linnlly,  it  was 
seen  that  a single  CPE  and  hardware  multiplier  could  satisfy  all  the  signal-processing  require- 
ments for  the  given  algorithms.  A complete  software  study  then  preceded  the  detailed  logic  de- 
sign. In  effect,  all  the  machine  code  was  written  or  blocked  out  to  verify  the  design.  In  spite 
of  our  avowed  goal  of  a special-purpose  vocoder  d«'vice,  in  the  end  we  designed  a ratlier  general- 
purpose  structure.  The  limited  in-out  capability  as  well  as  tlie  limited  data  ami  program  mem- 
ory are  what  remain  of  the  special-purpose  device.  The  end  design  is  based  on  a single  micro- 
processor CPE  augmented  with  a four-cycle  multiplier.  The  basic  ati'ueture  is  that  of  a two-bus 
general-purpose  machine  with  separate  program  and  data  memory  as  shown  in  Fig.  VIII-1. 


j 


HI 


ADR 


SPEECH  TO 
OUT  MODEM 


SPEECH  FROM 
IN  MODEM 


4.  4.  [4^  [ife 

I P/*l  I I IO-SKXWHAmI  i,g  ^2  k 


_li5 

UP  I I LP  I 


Fig.  VEI-1.  L PCM  block  diagram. 


~[TBM?35i-3] 


A.  B ADDRESS 
O- 

(control) 


FROM 

MEMORY,  I/O 


MICRO  • 1 

j 

R S 

FUNUIION  1 

DECODE  1 

1 

1 

AHlIHMLilC  LUOIC  UNIT 

TO  MEMORY,  I/O 

Fig.  VIII -2.  CPE  chip  block  diagram. 


82 


U.  l.MCM  SVSTKM  IJKSC'KII’TKJN 


1.  Architecture 

The  basic  block  diagram  tor  the  Ll’C'M  is  shown  in  Fig.  VIII-1.  .'Ul  instructions  for  tliis 
machine  are  executed  in  a 150-nsec  cycle  except  the  multiply  wtiich  requires  fccur  machine  cycles 
or  600  nsec.  Tlie  nucleus  of  this  system  is  tlie  C'l'i.'  wtiicli  is  leased  on  tlie  AMI)  2V01  micropro- 
cessor chip.  Four  sucli  chips  arc  used  along  with  a carry -lookaiiead  cliip  to  yield  a 16-t)it  flM.. 

A simplified  block  diagram  of  the  2901  apfeears  in  1 ig.  VlI(-2.  !■  rom  tliis  diagram  it  can  l>e 

seen  that  the  chip  consists  of  an  ALU  capable  of  add,  sutetract,  and  lioolean  operations  coupled 
with  an  internal  2-port  general  register  file  consisting  of  16  words.  Multiplexers  at  the  input 
of  this  register  fUe  permit  a 1 -bit  up-or-down  shift  prior  to  writing  the  memory.  A (1-register 
is  provided  which  allows  double  precision  shifts  to  be  implemented.  Inputs  to  the  chip  from  the 
outside  world  consist  of  two  4-bit  addresses  for  the  internal  register  file,  control  signals,  and 
data  from  external  devices  such  as  memory  or  I/O  devices.  The  manufacturer's  literature 
should  be  consulted  for  further  details  about  the  2901. 

Referring  again  to  Fig.  VIII-1,  it  is  seen  that  the  l6-bil  (T’F  is  connected  to  an  input  and  an 
output  data  line.  The  input  line  is  multiplexed  between  6 data  sources,  the  16 -bit  memory  out- 
put register  (MOR)  of  tlie  data  memory,  the  12-bit  A/D  converter,  the  ti-bit  serial-to-parallcd 
(S/P)  converter,  the  16-bit  upper  and  lower  products  coming  from  the  multipliei-,  and  an  11 -bit 
field  coming  from  the  instruction  register.  The  data  memory  consists  of  2K  16-bit  words. 

1.5K  of  which  are  ROM  and  contain  the  various  lookup  tables  needed  to  implement  the  l.PC  algo- 
rithm. The  output  of  the  CPE  is  channeled  to  the  D/A  converter,  the  parallel -to -serial  (P/S) 
converter,  the  memory  buffer  and  address  registers  (MBR  and  MAR),  and  the  multiplicand 
(MCD)  and  multiplier  (MPR)  registers  of  the  multiplier.  These  various  output  registers  are 
clocked  under  the  control  of  a 3 -bit  field  in  the  instruction  register. 

The  multiplier  uses  the  Booth -McSorley  algoritlim  to  multiply  two  16-bit  two' s-complement 
numbers  and  makes  the  full  32-bit  product  available  to  the  CPE's  input  ports  in  two  16-bit  pieces. 
The  multiplier  is  fabricated  from  the  AMD2  5S05  4 X 2 multiplier  chip.  Eight  of  these  ai-e  used 
to  construct  a 16  x 4 array  multiplier  which  is  clocked  four  times  to  yield  the  final  product. 

The  outputs  are  fully  buffered  so  that  the  product  may  be  retrieved  from  the  multiplier  any  time 
four  machine  cycles  or  longer  after  the  start  of  the  multiply.  The  CPE  is  free  to  do  other  tasks 
in  this  interval  while  multiplication  is  taking  place. 

The  program  memory  contains  IK  of  48-bit  words.  The  output  of  this  memory  is  clocked 
into  a micro-instruction  register,  and  the  memory  address  is  derived  from  the  program  control 
logic.  The  latter  is  based  on  the  AMD2909  program  sequencer  chip,  a simplified  block  diagr;un 
of  which  appears  in  Fig.  VIII-3.  Throe  of  those  4 -bit  chips  are  used,  making  it  possible  to  ad- 
dress 4K  of  program  memory  even  though  only  IK  of  such  memory  is  needed  for  the  present 
application.  The  2909  controller  is  driven  by  a 2-bit  control  line  which  enables  one  to  select 
the  next  program  address  to  be  either  the  last  address  plus  one,  a jump  address  which  comes 
from  the  micro-instruction  register,  the  latest  address  on  the  internal  stack,  or  aii  interrupt 
address  determined  by  the  l/O  system.  The  jump  logic  which  drives  the  control  |)orts  of  the 
2909  allows  for  unconditional  jumps,  conditional  jumps  depending  on  the  status  bits  coming  from 
the  CPE,  and  jumps  to  and  returns  from  subroutines.  Subroutines  may  be  nested  up  to  four  deep 
■ ( when  Interrupts  are  locked  out,  and  three  deep  when  they  are  active. 


Fig.  VIII-4,  LPCM  micro-instruction  word  format. 


f ’ 


The  I/O  system  for  the  LPCM  consists  of  two  input  channels  - the  A/U  and  S/l’  converters  - 
and  two  output  channels  - the  D/A  and  P/S  converters.  The  A/l>  - U/A  channi  Is  run  on  a com- 
mon 129.6-nsec  clock  that  is  derived  from  the  150-nsec  system  clock.  The  P/S  and  S/I’  con- 
verters run  on  external  modem  clocks  which  must  have  the  same  nominal  frequency  (2400,  3000, 
or  4800  Hz),  but  which  may  be  asynchronous  to  one  another.  The  J/O  channels  generate  an  in- 
terrupt request  whenever  their  associated  clocks  present  a rising  edge  to  the  system.  This  re- 
quest causes  the  program  control  logic  to  produce  a jump  to  one  of  three  predetermint  d locations 
in  program  memory  at  the  first  instance  the  system  finds  itself  in  a position  to  allow  interrupts. 
Several  interrupts  may  have  requests  pending  at  one  time;  they  are  serviced  in  order  of  their 
priorities  which  are  P/S,  S/P,  and  A/D  - D/A.  While  a given  interrupt  is  being  serviced,  all 
others  are  locked  out.  Upon  return  from  an  interrupt  service  routine,  the  software  releases 
interrupt  lockout  thus  enabling  the  honoring  of  further  interrupt  requests. 

2.  Instruction  Format 

The  format  of  the  48-bit-wide  instruction  word  is  shown  in  Fig.  VTII-4.  The  instruction 
word  is  divided  into  various  fields  of  varying  length,  the  functions  of  which  will  now  be  discussed. 

The  C^,  Ig,  and  1^  fields  determine  the  basic  operation  that  the  CPE  is  to  perform,  e.g., 
add  the  contents  of  internal  register  at  address  A to  the  contents  of  the  internal  register  at  ad- 
dress B,  or  take  the  external  data  presented  to  the  chip  and  logically  AND  them  with  tlie  contents 
of  the  internal  register  at  address  A.  A list  of  useful  combinations  of  these  fields,  along  with  a 
mnemonic  for  each,  is  given  in  Appendix  A at  the  end  of  this  Section. 

The  Ij  field  determines  where  on  the  CPE  chip  the  output  of  the  ALU  is  to  go.  Some  exam- 
ples are;  the  output  of  the  CPE  alone,  the  output  of  the  CPE  and  internal  register  file  at  ad- 
dress B,  or  the  output  of  the  CPE  and  the  Q-register. 

The  IC  and  CXI  fields  determine  where  the  CPE  gets  its  input  and  where  its  output  is  to  go, 
respectively.  The  IC  field  steers  the  input  6-way  multiplexer  to  any  of  the  input  sources  men- 
tioned above,  and  the  CXT  field  determines  which,  if  any,  of  the  output  registers  connected  to 
the  CPE  are  to  be  clocked.  The  A and  B fields  simply  supply  the  addresses  to  the  CPE's  two- 
port  memory  and  need  no  further  discussion. 

The  JPC  field  along  with  the  H and  S fields  provides  program  control  by  means  of  various 
kinds  of  jumps.  A complete  list  of  these  appears  in  Appendix  A.  Conditional  jumps  in  the  LPCM 
are  somewhat  unconventional  in  that  the  condition  on  which  the  jump  is  to  be  based  must  be  es- 
tablished in  an  instruction  preceding  the  actual  jump  instruction  by  means  of  the  TST  field. 

More  precisely,  if  one  wishes  to  conditionally  jump,  say,  based  on  whether  one  of  the  CPE's 
internal  registers  is  zero,  then  the  contents  of  this  register  must  be  made  to  appear  at  tlie  CPE 
output  with  an  instruction  that  also  has  the  TST  bit  set.  This  stiobes  the  CPE  .'-tatus  into  a 
(2-bit)  status  register  which,  in  turn,  may  be  tested  by  a subsequent  instruction  containing  the 
appropriate  jump  code. 

The  remaining  fields  are  quite  straightforward.  The  F field  appears  directly  at  the  CPE 
input  where  it  can  be  used  for  a constant  or  a base  address.  This  field  also  contains  the  jump 
address  and  must  be  set  accordingly  for  each  instruction  containing  a jump.  The  SIL  and  HIL 
fields  are  used  to  set  interrupt  lockout  and  release  interrupt  lockout,  respectively,  and  are 
primarily  used  to  prevent  interrupts  while  executing  calculations  that  an  interrupt  could  destroy 
such  as  an  ongoing  multiply.  The  SCY  and  ECY  fields  are  provided  to  facilitate  multiple- 
precision  adds  and  subtracts.  When  the  SCY  bit  is  set  during  an  add  or  subtract  instruction. 


'KXi-  I"  U,]JW 


1 


1 

i 

I 

j the  carry  resulting  from  this  operation  is  saved  in  a nip-flop.  This  saved  carry  can  then  be 

^ used  in  a later  add  or  subtract  instruction  by  setting  the  ECY  bit  during  that  instruction.  Fi- 

! nally,  the  IILT  bit  stops  the  machine  - a feature  that  is  only  used  during  debugging  operations. 

; The  two  bits  labeled  U are  unused. 

13.  Data-Memory  Addj-essing 

Addresses  for  the  Ll'CM  data  memory  must  be  generated  in  the  CFE  and  then  deposited  in 
the  MAR.  Direct  addressing  ot  data  memory  is  achieved  by  having  the  desired  address  in  the 
F field  of  the  micro-instruction  word  and  passing  it  through  the  CPE  to  the  MAH.  Indexed  ad- 
dressing can  be  accomplished  by  having  a base  address  in  the  F field,  adding  to  it  the  contents 
of  a CPE  internal  register,  and  depositing  the  result  in  the  MAR.  It  should  be  noted,  however, 
tiiat  the  contents  of  tiie  addressed  location  in  data  memory  are  only  available  as  a CT’E  input  one 
instruction  cycle  after  the  desired  address  is  placed  in  the  MAR.  This  is  due  to  tlie  fact  that  the 
memory  output  is  bulfered  in  the  MOR.  Writing  data  memory  is  also  a 2-step  process  in  the 
sense  that  tlie  address  must  first  be  calculated  and  deposited  in  MAR  before  the  datum  itself 
may  be  read  out  into  the  MBR. 


4.  Timing  Considerations 

The  basic  events  that  must  take  place  in  order  to  execute  an  LPCM  instruction  are; 

(a)  Program  counter  assumes  desired  state 

(b)  Program  memory  is  accessed 

(c)  Accessed  instruction  is  executed  by  CPE. 

It  is  not  possible  to  perform  all  three  of  these  operations  in  the  desired  cycle  time  of  150  nsec, 
so  the  sequence  is  broken  into  two  parts  by  inserting  the  microprogram  instruction  register 
after  the  program  memory.  This  results  in  what  is  called  a doubly  overlapped  pipeline  struc- 
ture in  which  instruction  fetch  takes  place  in  parallel  with  execution  of  the  instruction  fetched  on 
the  previous  machine  cycle.  This  type  of  pipelining  is  transparent  to  the  programmer  of  the 
LPCM. 

The  LPCM  also  employs  pipelining  in  the  data  memory  acquisition  patii  and  in  the  jump  con- 
trol path  as  has  been  described  earlier.  This  pipelining  is  not  transparent  to  Uie  programmer 
in  that  memory  addresses  and  jump  conditions  must  be  set  up  sufficiently  in  advance  of  the  in- 
struction that  makes  use  of  them.  Experience  has  shown  that  careful  programming  can  usually 
circumvent  any  potential  loss  of  program  efficiency  caused  by  these  pipelined  patiis  in  the 
machine. 

C.  ENGINEERING  CONSIDERATIONS 

The  present  LPCM  is  a prototype  designed  to  demonstrate  that  a dedicated  linear  predictive 
vocoder  can  be  realized  both  cheaply  and  compactly  using  off-the-shelf  components.  Since  it  is 
a prototype,  it  was  decided  to  use  standard  16-  x 7-in.  universal  wirewrap  boards  as  the  pack- 
aging medium  rather  than  go  directly  to  smaller  PC  boards.  Universal  boards  were  chosen  be- 
cause the  LPCM  uses  every  standard  package  size  from  14-  to  40 -pin  in  its  design.  The  final 
design  uses  162  DIPS  and  occupies  1.5  boards.  These  figures  include  all  the  analog  circuits 
required  before  and  after  the  A/D  and  D/A  converters.  The  power  consumption  of  tlie  device 
is  less  than  45  W.  A photograph  of  the  completed  LPCM  appears  in  Fig.  VIIl-5. 

! 

86 


s 


I'Lg.  VIH-S.  The  I'ompleted  I. PCM. 

.•\ppendix  U at  the  end  of  tills  Section  gives  a complete  compilation  of  the  parts  used  to  fab- 
ricate the  I. PCM.  Included  in  the  table  are  military  and  commercial  cost  figures  for  building  1, 
500,  lOUO,  and  10,000  processors.  Those  figures  arc  based  on  the  extrapolation  rules  provided 
by  the  N’arrow  Hand  \'oice  Consortium  Subcommittee  for  estimation  of  "cost  to  produce."  Tlie 
figures  referring  to  tlic  packaging  of  the  l.PCM  are  estimates  of  how  it  could  be  packaged  using 
I’C  lioards,  and  do  not  reOect  tlie  pi-csont  wire  wrap  packaging  of  the  prototype. 

I).  UHHl  C.CUN'C.  AM)  rP.ST  SVS'TKM 

1.  Ilardwarf'  and  .Software  Debugging  Aids 

The  1.1 ’CM  is  intended  to  be  a stand-alone  device  with  its  control  program  residing  in 
PROMs.  During  the  deluigging  plia.se.  however,  it  is  necessary  to  replace  tlie  I’KOM  with  R.AM 
in  order  to  faeiUtate  program  ehaiigos  and  allow  the  running  of  diagnostic  programs,  in  addition, 
it  is  extremely  advantageous  to  have  a moans  for  starting  and  stopping  the  machine,  scUing 
breakpoints,  ami  examining  the  contents  fif  data  memory  and  the  C PK' s internal  register  file. 

The  above  requirements  wore  nu'f  by  the  design  and  fabrication  of  a separate  unit  — the 
l.PCM  tester  - which  is  eonnei-tcd  to  the  l.PCM  by  means  of  cables  during  the  debugging  phase. 
Tlie  main  com|)onent  of  llie  ti'ster  is  a 1024  x 4H  RAM  wliich  effectively  replaces  the  PROM  des- 
tined to  reside  in  the  I .PCiVI.  In  addition,  the  tester  duplicate.s  the  AM2909  program  control  chips 
that  are  located  in  the  l.PCM  itself.  This  was  done  to  minimize  both  the  number  of  control  cables 
between  the  I PCM  and  its  tester  and  tlie  te.stcr-oricntod  logic  needed  in  the  1,1’CM. 

The  to.ster'.s  program  memory  can  lie  loaded  in  either  of  two  ways:  (a)  one  register  at  a 

time  by  means  of  front-|)unel  switches,  or  (h)  tlie  entire  memory  ciui  be  loaded  from  a host  com- 
puter. The  first  mode  is  useful  for  toggling  in  small  test  programs  and  for  patching  larger  pro- 
gr.am.s.  The  latter  mode  is  used  for  loading  large  jirograms  .such  as  the  diagnostic  .system  or 
tlie  I.PC  vocoder  program  itself.  When  tlie  test'er  is  connected  to  the  l.PCM,  tlie  following  con- 
trol functions  are  availalile: 

(a)  .Staid  program  at  an  arhitrary  address 

(h)  Stop  program 

(e)  .Single-step  program 


H7 


(d)  Stop  at  brcakpoijit  determined  by  switches 

(e)  inspect  any  location  in  data  memory 

(f)  Inspect  any  location  in  C PE  register  file 

(g)  Inspect/cliange  any  location  in  program  memory.  i 

In  addition  to  the  above-mentioned  hardware  debugging  aids,  an  ejdensive  software  diagnos- 
tic system  was  written  for  the  LPCM.  This  system  tests  tlie  following  functions  of  tlie  EPt'M: 

(a)  RAM  portion  of  data  memory 

(b)  CPE  functions 

(c)  Jump  logic 

(d)  Multiplier 

(e)  I/O. 

2.  The  LPCM  Simulator  and  Assembler 

A simulator  tor  the  I.PC  M was  written  on  a I'nivac  1219  computer  so  that  software  debugging 
could  take  place  in  parallel  with  the  fabrication  of  the  l.J’CM  liardware.  The  simulator  accepts 
as  its  input  the  binary  code  generated  by  an  LPCM  assembler.  This  assembler  was  :dso  written 
on  the  Lnivac  1219  and  is  a straightforward  two-pass  assembler  that  understands  LPCM  mne- 
monics and  symbolic  addresses.  Symbolic  code  is  generated  using  tlie  Lnivac' s editor  and  then  ^ 

fed  to  the  assembler  which  produces  a binary  output  that  can  be  loaded  into  the  LPCM  or  operated 
on  by  the  simulator.  This  same  binary  output  was  later  used  to  burn  in  the  l^ROMs  that  com- 
prise the  1,  PCM's  program  memory. 

The  simulator  is  fairly  sophisticated  in  that  it  simulates  all  I/O  operations,  including  inter- 
rupts. This  allowed  the  debugging  of  not  only  the  diagnostic  package  but  the  entire  I.PC  vocodc’r 
program  itself.  In  the  final  stages  of  the  vocoder  programming,  real  speech  was  used  as  the  in- 
put to  the  simulator  and  the  synthetic -speech  output  of  the  program  was  stored  on  magnetic  tape. 

All  computation  was  done  in  non-real  time,  but  the  final  output  tape  was  then  played  back  in  real 
time  to  provide  convincing  evidence  that  the  LPCM  vocoder  algorithm  was  functioning  correctly. 

This  indeed  proved  to  bo  tlie  case,  because  only  a few  additional  , rogram  bugs  wore  found  when 
the  program  was  finally  running  on  the  LPCM  itself. 

E.  FIRMWARE  CONSIDERATIONS 

1.  The  LPC  Algorithm 

LPC  was  first  described  by  Atal  and  Ilanauer  in  1971.^  Since  then,  many  variations  on  this 
algorithm  have  appeared  in  the  literature  (see  bibliographies  in  Refs.  2 and  6).  We  ha\  e chosen 

to  implement  the  Markol  form  of  the  LPC  algorithm  for  reasons  detailed  in  Ref.  7. 

This  algorithm  is  described  in  block-diagram  form  in  Fig.  Vlll-b.  Speech  sani|)les  taken 
every  129.6  psec  are  divided  into  198-point  nonoi'crlapping  groups  correspotKling  to  approximati-Iv 
20  msec  of  data.  These  groups  are  multiplied  by  a Hamming  window  and  then  used  to  form 
P + 1 autocorrelation  coefficients  . K,,.  The  parameter  P is  the  order  of  the  filter  used 

to  model  the  vocal  tract,  and  ranges  from  10  at  2400  bps  to  12  at  3600  and  4800  bps. 

The  autocorrelation  coefficients  are  used  as  tlie  constants  in  a set  of  linear  equations  that 

must  be  solved  to  obtain  the  parameters  of  tlie  vocal -tract  filter.  I'hese  liquations  are  solved 

by  means  of  the  Levinson  recursion  which  yields  a set  of  P renection  coefficients  K . K 

o’  ' |*-1 

and  a residual  energy  F*/,  TFiese  reflection  coefficients  will  be  useil  at  the  receiver  to  iniplemeiit 


88 


l-'if;.  \ lll-t..  I'hc  l.l’C  viH'oiU'f  algorithm. 


the  vocal-tract  I'ilter.  The  structure  chosen  I'or  tliis  I'iltcr  is  the  acoustic-tube  filter  described 
in  detail  in  Kof.  Z.  The  residual  energy  is  us'cd  at  (nc  receiver  to  generate  the  amplitude  of  the 
excitation  for  tlic  acoustic  tube. 

In  addition  to  the  processing  descriheil  above,  the  raw  speech  samples  are  fed  to  a pitch  ;uid 
voicing  detector  wliieh  produces  both  a voiced-unvoiced  decision  and  an  estimate  of  pitcii.  'I'lie 
particular  algorithm  used  for  this  fuir()Ose  is  the  Clold-Habiner  pitch  detector  whicli  is  described 
in  detail  in  Uefs.  9 and  10. 

The  parameters  produced  as  described  above  are  next  coded  and  formed  into  a serial  bit 
stream  for  transmission  to  the  remote  receiver.  The  receiver  portion  of  tlic  algorithm  acct.'tits 
such  a serial  bit  stream  from  tlie  remote  tran.smitter  and  unpacks  it  to  form  the  code-book  ad- 
dresses of  the  various  parameters.  These  addresses  are  then  decoded  to  obtain  tlic  actvud  values 
of  tae  parametei's,  which  are  then  used  to  implement  the  acoustic-tube  filter  and  its  excitation. 
The  outtnit  of  the  filter  is  the  final  synthetic  speech. 

The  coding  of  tile  parameters,  except  for  pitch  which  is  transmitted  as  is,  is  acconiiilished 
hv  a logarithmic-search  table-lookup  routine.  The  residual  energy  is  logarithmically  coded  to 
“i  hits.  The  reflection  coefficients  are  coded  by  means  of  truncated,  log-area  ratios.  Kach  la-- 
flection  coefficient  is  first  clamped  to  an  individually  selected  interval,  transformed  by  the  log- 
area -ratio  function  (login  - K)  '(I  ■ Kill,  and  finally  truncated  to  the  desired  number  of  bits. 
The  number  of  bits  usi'd  for  the  individual  K's  is  a function  of  the  de.sired  transmission  rate. 

2.  lm|)lcmentation  of  the  l.l’C  ;\lgorithm 

The  l.l’C  program  consists  of  four  major  pieces  — a background  program  that  hamlles  all 
the  com|)utation  that  need  only  be  performed  once  per  frame  and  three  interru(>t-service  routines 
that  handle  the  eomputations  that  must  be  done  for  each  modem  clock  ;uid  each  .A-l)^l)-.'\  clock. 


itea 


r 


I 


The  A-O/U-A  intcrrupt-servicc  routine  uses  the  newly  arrived  spcccli  sample  to  update  tlie 
current  windowed  correlation  and  the  six  elementary  pitch  detectors.  In  addition,  tlie  acoustic- 
tube  filter  is  updated  to  produce  a new  syntlietic-speech  sample  for  the  U/A  converter.  This 
approach  eliminates  tlie  need  for  any  substantial  buffering  of  raw  speech,  thus  reducing  our  data 
memory  requirements.  Tlie  retloction  coefficients  for  the  acoustic  tube  are  interpolated  against 
the  coefficients  for  tlie  next  frame  every  5 msec,  and  the  amplitude  is  interpolated  every  time 
a new  pitch  pulse  is  generated.  .Mo  amplitude  interpolation  takes  place  during  unvoiced  frames. 

The  main  task  of  the  P/S  converter  interrupt-service  routine  is  to  pass  the  coded  data  pro- 
duced by  the  analyzer  portion  of  the  program  to  the  transmit  modem.  This  is  accomplished  by 
loading  the  first  code  word  into  the  P/S  converter  and  then  counting  a number  of  interrupts  equal 
to  the  known  number  of  bits  in  this  word.  Subsequent  words  are  then  loaded  and  the  appropriate 
number  of  interrupts  counted  after  each.  When  a complete  frame  of  code  words  ha.s  been  seri- 
alized in  this  fashion  and  passed  to  the  transmit  modem,  the  current  correlation  coefficients 
are  transferred  to  registers  used  by  the  background  routine,  the  correlator  is  reset  to  start  a 
new  correlation,  and  a flag  is  set  to  tell  the  background  routine  to  start  a new  frame  calculation 
using  the  now  correlation  coefficients. 

The  S/p  interrupt-service  routine  receives  serial  data  from  the  receiver  modem.  It  de.se- 
rializcs  this  stream  into  the  proper-length  code  words  using  an  interrupt -counting  technique- 
similar  to  tlie  one  used  by  the  P/S  converter.  The  code  words  are  then  used  to  access  decoding 
tables  thus  producing  the  parameters  eventually  used  by  the  acoustic-tube  synthesizer.  These 
parameters  are  transferred  to  the  buffer  used  by  the  acoustic  tube  when  the  R/P  routine's 
counters  determine  that  it  has  received  a complete  frame  of  new  data. 

The  deserialization  procedure  just  described  only  makes  sense  if  the  S/P  routine  "knows" 
where  the  first  code  word  of  a frame  is  in  the  incoming  bit  stream.  The  process  of  making  this 
determination  is  known  as  frame  synchronization  and  is  another  task  of  the  S/P  routine.  Frame 
synchronization  is  established  by  having  the  transmitter  transmit  a known  bit  pattern  in  place  of 
the  pitch  word  during  unvoiced  utterances.  The  pattern  is  chosen  to  correspond  to  an  illegal 
(too  high)  pitch  so  that  the  receiver  can  still  make  an  unambiguous  buzz/hiss  decision.  The 
frame  synchronization  algorithm  now  consists  simply  of  searching  for  this  known  pattern  in  tlie 
serial  bit  stream  as  it  arrives  at  the  receiver.  Synchronization  is  declared  (i.e.,  knowledge  of 
the  location  of  the  pitch  word)  when,  and  only  when,  the  known  pattern  has  been  found  at  the 
same  location  in  six  consecutive  frames.  When  this  occurs,  the  S/P  routine  sets  its  bit  and 
word  counters  accordingly,  thus  establishing  synchronization. 

The  final  routine  to  be  discussed  is  the  background  routine.  The  start  of  this  routine  is  an 
idle  loop  whose  sole  purpose  is  to  continually  check  the  status  of  the  frame-ready  flag  that  is  set 
by  the  P/S  interrupt-service  routine.  As  long  as  this  tlag  is  clear,  the  program  remains  in  the 
idle  loop  except  for  those  times  when  an  interrupt  arrives  and  transfers  control  to  the  appro- 
priate service  routine.  When  the  flag  is  finally  set,  the  program  drops  out  of  the  idle  loop  and 
begins  its  once-a-frame  computations.  The  first  of  these  is  the  final  determination  of  pitcji  by 
a routine  that  examines  the  status  of  the  six  elementary  pitch  detectors  and  produces  a buzz/hiss 
decision  and  an  appropriate  pitch.  Next,  the  double -precision  correlation  coefficients  are  put 
into  a block-floating-point  format  based  on  R(0)  and  passed  on  to  the  Levinson  recursion  which 
produces  the  desired  reflection  coefficients  and  the  residual  energy.  The  latter  is  unnormtilized 
to  remove  the  scale  factor  introduced  by  the  block-floating-point  routine,  and  then  the  parameters 
are  coded  using  the  appropriate  coding  tables.  The  final  code  words  are  placed  in  a bufh'r  where 


J 

3 


')0 


I 


till'  I’/S  routine  can  access  them  for  shipment  to  the  transmit  modem.  Control  is  then  returned 
to  the  idle  loop.  It  should  bo  ompliasizcd  that,  while  the  background  routine  is  calculating,  in- 
terrupts are  active  which  means  that  tlie  background  routine  is  only  actually  working  in  tJie  in- 
tervals when  no  interrupt-service  routine  is  in  progress. 

One  final  routine  should  be  mentioned,  namely  the  initialization  routine.  This  routine  starts 
at  [irogram  address  zero  and  is  only  entered  on  power-up  or  when  the  initialize  pushbutton  is 
pressed.  The  main  function  of  this  routine  is  to  clear  data  RAM,  initialize  the  few  RAAl  reg- 
isters that  require  it,  and  finally  determine  which  rate  vocoder  is  desired.  The  latter  function 
is  accomplished  by  sensing  a front-panel  rate-control  switcli  and  then  setting  pointers  to  tiie 
proper  coding  and  decoding  tables.  In  addition,  if  the  rate  selected  is  2400  tips,  tiic  filter  order 
is  eliangcd  from  12  to  10, 

T.  CONCU'SIONS 

The  design  concessions  that  mark  the  LPCM  as  a special-purpose  macliine  designed  to  tie 
a speech  terminal  are;  limited  I/O  capability,  and  limited  data  and  program  memory.  Tlic  l/O 
bus  only  communicates  with  A/U  - D/A,  parallel -to -so rial  modem  input,  and  serial -to-parallcl 
modem  output.  Tlie  I.t’CM  data  memory  consists  of  1536  locations  of  l6-bit  ROM  tables  and 
512  locations  of  l6-bit  RAM  words.  Tlie  program  memory  consists  of  IK  by  48  bits  of  tfOM,  of 
whicti  less  than  800  locations  are  used.  A priori  knowledge  of  the  operating  algorithms  as  well 
as  an  operating  simulator  and  diagnostics  reduced  the  entire  time  from  design  to  completion  to 
less  Uian  one  year.  Tlie  present  paekage  requires  162  UIPs  including  audio  circuits,  dissipates 
loss  than  45  W,  and  occupies  about  l/3  cubic  foot.  The  operating  code  occupies  the  machine  for 
about  65  percent  of  real  time. 

As  a prototype  device,  the  LRCM  specifications  are  not  as  tight  as  they  might  be.  Given 
the  65-pcrcent  utilization,  the  cycle  time  can  be  slowed  to  over  200  nsec  and  power  dissipation 
reduced  by  roughly  10  W.  The  volume  can  be  reduced  by  as  much  as  a factor  of  3 if  Rf  boards 
are  used  and  tighter  packaging  is  designed. 

The  overall  package  count  of  162  various-sized  DIPS  includes  the  7 packages  of 
AMU  [C’PK  (4)  and  AMU  sequencer  (3)],  about  40  packages  of  memory  and  memory-related  cir- 
cuits (20  packages  for  multiplier,  and  the  rest  for  I/O),  bus  multiplexing,  timing,  interrupt, 
and  branching.  It  is  clear  that  in  terms  of  power  and  size  the  device  is  not  defined  by  the  micro- 
processor chips.  The  overall  machine  size  is  determined  by  all  of  the  "glue  logic"  and  memory 
packages  which  swamp  out  the  microprocessor  chips.  In  fact,  the  memory  and  memory-related 
packages  probably  represent  a lower  bound  on  size  and  power,  in  the  sense  tliat  cvcrytliing  else 
may  shrink  considerably,  but  the  current  memory  size  and  power  are  relatively  static. 

APPENDIX  A;  LPCM  MNEMONICS 

The  following  is  a compilation  of  the  bit  assignments  that  must  be  made  to  the  fields  of  the 
I. PCM  micro-instruction  word  to  achieve  various  functions.  Each  of  these  assignments  is  pre- 
ceded with  a mnemonic  that  can  be  used  when  preparing  code  for  the  LPCM  assembler.  The 
first  group  of  these  assignments  are  the  so-called  "op  codes"  which  affect  the  C^^,  I^^,  and  1^ 
fields.  The  format  of  the  presentation  consists  of  a mnemonic  followed  by  a 3-digit  octal  num- 
ber giving  the  values  assigned  to  C^,  I^,  and  I^,  respectively,  followed  by  a brief  description  of 


91 


the  operation  aeeonii'lished  by  the  assignment.  The  result  oi  llie  operation  appears  at  tlie  inter- 
nal Al  l'  output  port.  The  I'ollovving  notations  arc  used  in  the  descriptions: 

111  A)  Contents  of  internal  register  addressed  by  the  A field. 

11(B)  tOntonts  of  infernal  register  addressed  by  the  H field. 

Q Contents  of  the  C^-i-egistcr. 

1)  iJata  at  input  port  of  the  CBE 

• Logical  AN  I ) 

I Logical  OH 

© Logical  exclusive  OH 

■r  Logical  complement 

It  should  be  noted  that  all  possible  operations  which  the  Cl'K  is  capable  of  are  not  included  in 
the  following  list. 


ABDAB 

00  f 

H(A)  t H(B) 

ADUDA 

00  5 

D ■ H(A) 

ABDAB1 

101 

R(A)  ^ H(B)  t 1 

ABDUAl 

105 

D ■ H(A)  H 1 

SI  BBA 

111 

R(B)  - H(A) 

SIBAD 

115 

H(A)  - D 

SCBAB 

121 

> 

1 

Sl'BDA 

125 

D - H(A) 

SI  BBAl 

Oil 

H(B)  - H(A)  - 1 

SCBAUl 

015 

R(A)  - D - 1 

SlBABl 

021 

R(A)  - R(B)  - 1 

SI  BBAl 

025 

D - R(A)  - 1 

MOVB 

033 

R(B) 

AIOV'A 

034 

R(A) 

MOVB 

037 

D 

INCB 

103 

R(B)  + 1 

INCA 

104 

R(A)  1 1 

INCD 

107 

D t 1 

DECB 

013 

R(B)  - 1 

DECA 

014 

H(A)  - 1 

DECD 

027 

D - 1 

CSB 

123 

-R(B) 

eSA 

124 

-R(A) 

CSB 

117 

-D 

AN  DAB 

041 

R(A)  • R(B) 

ANUDA 

045 

D • R(A) 

GRAB 

031 

R(A)  1 R(B) 

OR  DA 

035 

D 1 R(A) 

XORAB 

060 

R(A)  © H'B) 

YORDA 

065 

D © H(A) 

92 


i 


( Al  l>lt 

02  3 

r H(H) 

CM  I’A 

024 

“I  K(A) 

CM  I'D 

017 

D 

cut 

142 

0 

The  next  set  of  assignments  eoneerns  the  destination  field  uliicli  determines  vvJiero  the 
output  of  tlie  AI.V  is  to  go,  Tlie  format  is  mnemonic,  1 -digit  octal  riumboi-,  and  description. 


Tlie  notations  1'  for  Al.l 

output  and  V for  CPE  output  are  used  in  the  descriptions. 

Q 

0 

i - Q,  F - Y 

V 

1 

F — y 

RAY 

2 

F — R(H),  R(A)  -*  Y 

R 

3 

F — R(B).  F -*  Y 

SDU 

4 

Double-precision  down  shift 
[F.  Q]/2-[R(B),Q] 

1'  — Y 

SD 

5 

F/2  - R(B),  F -*  Y 

SX^D 

6 

Double -precision  up  shift 
[F,  QJ*2  -[R(B),Q1 
r-'  — Y 

SC 

7 

L'*  2 - R(B),  F -*  Y 

The  next  set  of  assignments  concerns 

the  IC  field  which  controls  the  input  multiplexer  to 

the  CPE.  The  format  is 

mnemonic,  1 -digit  octal  number,  and  description. 

SP 

0 

Scrial-to-parallel  converter 

ADC 

1 

A/D  converter 

I,P 

2 

Bits  0 to  15  of  the  product 

CP 

3 

Bits  15  to  30  of  the  product 

MOR 

4 

Memory  output  register 

FD 

5 

11 -bit  instruction  fieid 

The  clocking  of  the  various  registers 

connected  to  tlie  output  of  the  CPii  is  controlled  by 

the  output  control  field  OC.  The  format  is  the  same  as  for  the  input  control  field. 

NIL 

0 

Clock  nothing 

MAR 

1 

Clock  memory  address  register 

MBR 

2 

Clock  memory  buffer  register 

MCD 

3 

Clock  multiplicand  register 

DAC 

4 

Clock  d/a  converter  buffer  register 

PS 

5 

Clock  into  P/S  converter 

MPR 

6 

Clock  multiplier  register  and  start 
multiply  sequence 

The  final  group  of  assignments  concerns  the  jump  control  fields;  JPC,  S,  and  R.  The  for 

mat  is  mnemonic,  3-digit  octal  numbers  giving  the  assignment  to  the  JPC,  S,  and  R fields,  re 

spectively,  and  a description. 

NIL 

000 

No  jump 

JP 

100 

Lnconditional  jump 

JPZ 

200 

Jump  if  positive  or  zero 

JZ 

300 

Jump  it  zero 

JN 

400 

Jump  if  negative 

JNZ 

500 

Jump  if  not  zero 

JSW 

600 

Jump  if  switch  w on 

JSV 

700 

Jump  if  switch  v on 

JPS 

110 

Unconditional  jump  to  subroutine 

JPZS 

210 

Jump  to  subroutine  if  positive  or  zero 

JZS 

310 

Jump  to  subroutine  if  zero 

JNZS 

410 

Jump  to  subroutine  if  negative 

JSWS 

610 

Jump  to  subroutine  if  switch  w set 

JSVS 

710 

Jump  to  subroutine  if  switch  v set 

SBR 

101 

Return  from  subroutine 

APPENDIX  B;  l.PCM  SPECnaCATIONS 
Cycle  Time  150  nsec 


Cycle  Time 
Basic  Logic  Family 

Program  Memory  (ROM) 
Data  Memory  (ROM) 

Data  Memory  (Active) 
Hardware  Multiplier 

Basic  CPE 
Micro  sequencer 
Audio  Conditioning 


Total  DIP  Count 
Total  Power  Dissipation 
Construction  Technique 


TTL  using  low-power  Schottky  TTI.  in  AMD  chips, 
high -power  Schottky  where  necessary  in  critical 
paths. 


IK  X 48  bits  12  - MMI  6351  (IK  X 4) 

1536  X 16  bits  4 - MMI  6351  (IK  X 4) 

2 - FCLD  9 3448  (512  X 8) 

512  X 16  bits  8 - FCLD  93442  (256  X 4) 

One  quarter  of  an  array  operating  in  150 -nsec 
4 X 16  multiply 

8 - AMD  25S05  (2  x 4) 

4 - AMD  2901  (4-bit  slice) 

3 - AMD  2909  (4-bit  slice) 

12-bit  A/D,  d/a  conversion  at  129.6-fisec  samples. 
Input  Filter  8th  order,  elliptic  filter  52 -dB  stop-band 
attenuation  1.2 -dB  ripple,  cutoff  at  3 596  Hz. 

Output  Filter  8th  order,  elliptic  fUter  41-dB  stop-band 
attenuation  0.2 -dB  ripple,  cutoff  at  3 596  Hz. 

162 


Two  universal  wirewrap  boards  (50  percent  of  second 
board  unused)  7 x 16  in. 

Center  plane  voltage 
Two  outside  planes  ground 


AD-A041  246 
UNCLASSIFIED 


MASSACHUSETTS  INST  OF  TECH 
SPEECH  evaluation. <U) 

SEP  76  B GOLD 


LEXINGTON  LINCOLN  LAB 
ESD-TR-76-382 


F/G  17/2 


F1962B-76-C-0002 

NL 


20=2 

A0A04I246 


END 

DATE 

FILMED 

8 -77 


f 


I 


'4LS174 


i{i;iKKiJNCi;s 


1.  H.  S.  Alai  and  S.  I.,  llanaucr,  J,  Acoust.  Soc.  Am.  6i7  (1971). 

2.  J.  U.  iMarkel  and  A.  II.  Gray,  Jr.,  Linear  Prediction  of  Spec-ch 
(Springer-\  i'rlaf>,  New  "I  ork,  1976). 

7.  i;,  M.  Ilolstctter  et  aL,  " Vo<  odcr  Implementations  on  the  Lincoln 

IJiHital  Voice  Terminal,"  ]:ASCON'7S.  Washington,  IJ.  C.,  29  S('p- 
tombrr  — 1 October  197'^. 

4.  l>.  L.  Hlankenship.  "I.D\  T;  High  Herlormance  MLni-Gomputer 

for  Ileal- rime  Speech  I'rottessing,"  KASCON  '79,  Washington,  U.  C., 
29  September  — 1 Oetober  1975. 

5.  1’.  li.  Blankenship,  "Preliminary  Investigation  of  Digital  Speech 
Processor  Hardware  Imjilementations,"  Technical  Note  197  5-8, 
Lincoln  Laboratory,  M.I.T,  (5  February  1975),  DIX'  AD-A007062/3. 

6.  J.  D.  Markel  and  J.  J,  Wolf,  "Linear  Prediction  and  the  Spectral 
Analysis  of  Speech,"  BBN  Heport  No,  2304,  Bolt,  Beranek  and 
Newman  Inc,,  Cambridge,  Massachusetts  (August  1972). 

7.  J.  D.  Markel  and  A.  11.  Gray,  Jr.,  IEEE  Trans.  Acoust.,  Speech, 
and  Signal  Processing  ASSP-22,  124  (1974). 

8.  N,  Wiener,  Extrapolation,  Interpolation  and  Smoothing  of  StaUon- 
arv  Time  Series  (The  Technology  Press  and  J.  Wiley  and  Sons, 

New  York,  19  57),  Appendix  B. 

9.  B.  Gold  and  L.  R.  Rabiner.  J.  Acoust.  Soc.  Am.  46,  442  (1969). 

10.  M.  L.  Malpass,  "The  Gold -Rabiner  Pitch  Detector  in  a Real-Time 
Environment,"  EASCON '7  5,  Washington.  D,  C„  29  September - 
1 October  197  5, 


97 


IX.  (.llAHGK-lliAN.SlKH-mA  U i:  IM  1 ‘1 ,1  A1 1 ; XI  ATION  ()1  (I1ANM;|,  \()(()I)I.HS 

As  a result  of  l’4i  I-.'  romparisons  hetween  the  I . K.  Helyarit  i.  l-khps  < hannel  .ocoder  au'l 
2.4'kbps  I.l’C  svstems  \vhi<  h iri(lie:ite<l  the  Helttard  equipment  to  he  ( ompetitive  in  ar-e  is  id 
"robustness"  and  qualitv  aeeeptam  e,  a small  simulation  sludv  was  performed  at  l.itii  oln  l.aho- 
ratorv.  The  result  of  tlie  snids  was  ;i  Helgard  sinndtiiion  running  on  a l.iiu  oln  I ,alK>r  itor  . I)\  l . 
I'nfortunatel V,  the  ihannel-voeoder  structure  requi res  manv  digit al  filti- rs  so  that  re.d-iime 
operation  is  just  possible  on  a DV'f,  compared  with  tlie  Iess-than-40-percent  running  lime  of 
LPC  using  standard  digital-filter-oomputation  structures.  Progress  in  cha rge-i  oupled  (i  ( In 
or  charge-transfer  decdces  ((  'rn),  however,  offers  tlie  possihilitv  of  efficient  channel- vocoder- 
filter  implementations.  A small  studs  was  undertaken  loittlls  with  the  l.lec'ronies  Keseat'  li 
Laboratory  of  the  I niversity  of  California  at  Berkeley  to  (a)  sludv  the  oserall  vocoder  con- 
figuration, (b)  design,  fabricate,  and  test  a prototvpe  full-wave  rectifier-desampling  fiber, 

(c)  breadboard  a discrete  transversal  bandpass  vocoder  filter,  and  fdl  creaU'  a fuliv  iniegrable 
operational-amplifier  design  compatible  with  the  on-chip  transversal-filter  environment. 

The  overall  vocoder  configuration  is  based  on  a Belgard  structure  using  finite  inipulse 
response  (I'lR)  transversal  filters.  With  this  approac  h,  the  ( Tlis  caii  implettieni  most  of  the 
filters,  with  the  envelope  detector  in  the  vocoder  analysis  one  of  the  unknown  factors.  The 
second  task  of  designing  the  full-wave  rectifier-desampling  filter  was  completed  during  the 
3-month  tasking,  and  the  completed  chip  was  delivered  to  Lincoln  Laboratorv  lor  evaluation. 

The  chip  is  shown  in  I'ig.  IX-1. 

ABSOLUTE  FILTER  X2|JJC4' 

value  ' ' 

INPUT 


delay-line 

OUTPUT 


FILTER  OUTPUT  AMPLIFIER 
I buffet  I 

I'ig.lX-l.  The  rectifier-desampling  filter  (hip. 

The  input  circuit  of  the  chip  utilizes  the  differential  capabilities  of  a "fill  and  spill"  t v|>e 
input^  in  order  to  calculate  the  absolute  value  of  the  applied  signal.  This  input  uses  a i ombi- 
nation  of  charge-coupled  and  bucket-brigade  type  transfer  mechanisms  to  yield  a structure 
which  requires  only  a single  level  of  metalization.  This  avoids  the  problem  of  threshold  shifts 
which  occur  between  different  metalization  levels  and  thus  yields  increased  dynamic  range  for 
the  absolute  value  circuit.  In  addition,  the  input  structvtre  was  designed  to  atitomalicallv  in- 
corporate a bias  charge  level  since  the  addition  of  a DC  level  to  the  input  .signal  would  degrade 
the  accuracy  of  the  absolute  value,  f igure  lX-2  shows  the  input  signal  (bottom  trace),  the  out- 
put of  the  absolute-value  circuit  (top  trace),  and  the  filter  output  (using  absolute-value  input) 
after  30  stages  of  delay  fhrotigh  the  CTD  filter  (middle  trace). 


99 


PRECfSDI>G  PAfJE  ELaNK.NCT 


SiniT  the  output  of  Ihu  absolule-\ alup  lircuil  is  charge,  it  is  dirertly  compatible  witii  the 
input  of  a C'l'l)  t raiisversal  filter.  Tlie  filter  was  implemented  using  tiie  split-electrode  ap- 
proat  It  to  ol)tain  an  extremely  narrow  low-pass-filter  characteristic.  I'lie  optimal  linear- 
phase  weighting  coefficients'^  were  determined  with  the  constraint  that  tiie  weighting  coeflicietit 
l)e  positive.  .Ml  positive-weighting  i oeflicients  have  two  ttdvantages:  the  tap-weigtit  accuracy 

could  lie  increased  Its  a factor  of  two,  and  it  made  possilile  ,in  extremelv  simple  on-chip  output 
cin.  uit.  This  circuit  was  composed  of  a reset  switch  and  a source  follower,  and  used  the  stnisc 
line  capacitance  to  integrate  the  signal  charge.  In  addition,  one  phase  clocking  was  used  whicli 
turther  reduces  off-chip  coniplexity.  I he  filter  respotise,  shown  in  l ig.  IX-3,  has  a peak  side- 
lobe  level  of  45  dB  down  from  the  filter  passband  response.  Additional  measurements  havit 
shown  the  total  harmonic  distortion  to  be  less  than  1 percent.  These  results  indicate  that  tin; 

3 

complicated  output  circuits  usually  used  lor  t'TD  filters  are  not  necessarily  required  to  olttain 
adequate  performance. 

Preliminary  results  of  task  (dl  concerning  a fully  integrable  MOS  operational  amplifier 
are  encouraging.  The  possibility  of  using  the  MOS  op-amp  for  a sampled  data-recursive  filler 
may  prove  to  be  more  efficient  than  the  t 'TD  transversal  filter.  'This  is  a next  logical  area  of 
study  in  the  quest  for  an  efficient  channel- vocoder  realization. 

If  the  filler  realization  is  feasible,  the  remaining  areas  of  work  to  implement  tiie  vocotier 
are  those  involving  known  digital  ciri-uiis  technology  for  encoding,  decoding,  and  pitch  detection 


HKITIRENCHS 


1.  C.  R.  llewes,  "A  Self-Contained  800  State  CCiD  Transversal  Filter," 
Proc.  C'C1)'75,  San  Diego.  October  1975. 

2.  J.  H.  McClellan,  T.W.  Parks,  and  L.  R.  Rabiner,  "A  Computer 
Program  for  Designing  Optimum  FIR  Linear  Phase  F'ilters," 

IKKF  Trans.  Audio  Elect roacoust.  A 11 -21.  506  (1973). 

3.  K.  1).  Baertsch  et  aL,  "The  Design  and  Operation  of  Practical 
Charge-Transfer  Device  Transversal  I'ilters,"  IEEE  'Trans. 
Electron.  Devices  ED-2_3,  133  (1976). 


101 


I'his  report  describes  the  I 'S  7t.-7T  Lincoln  Lnboratciry  effort  under  the  IK'A  Speet  li 
C.aluution  contract.  A sepnrtite  report  describes  the  effort  undertaken  on  .Systems  Implica- 
tions of  l^acketi/ed  Speech. 


.\s  We  discussed  in  tlie  Overview  Section  (Sec.  II,  the  h i 7i)-7'l'  effort  has  been  dirctcted 
timard  " roltusiness"  of  ntirrowlta  nd  speech  te  rniinals,  improving  vocoder  interoperability,  and 
implementing  a low-cost  1,1’C  terminal.  The  work  toward  "roitust"  voice  coding  reported  here 
has  not  yet  liad  its  imptict  on  narrowlttind  hardware  devices,  but  we  are  within  sight  of  that  goal. 
Hopefully,  I N 77  efforts  will  incorporate  this  year's  algorithms  into  next  year's  devices.  The 
medium-rate  coding  prolileni  may  yield  to  low-cost  adaptive  predictive  devices,  since  the  -MT' 
approach  vields  excellent  t|uality  output  speech  at  16  kbps.  The  obstacle  is  hardware  cost  and 
complexity.  The  tandem  quality  prol)lem  will  probably  remain  difficult  until  we  succeed  at 
improving  the  narrow  and  wideband  terminals  independently.  Our  w'ork  on  the  low-cost  LPf 
terminal  has  been  very  .successful.  We  have  produced  two  microprocessor-based  LPC  vocoders 
which  can  drive  modems  at  2.4,  1,6,  atid  4.8  kbps.  The.se  devices  have  elicited  much  interest 
from  other  military  agencies,  private  agencies,  and  the  Lincoln  Laboratory  Communications 
Division. 

Our  program  in  77  continues  work  on  narrowband  speech  algorithms  aimed  at  better 
modeling  of  the  speech  wave,  by  adaptive  techniques,  smoothed  estimates,  and  more  accurate 
filter  models  which  include  zeros.  .A  concentrated  conferencing  effort  in  I'Y  77  will  start  with 
an  elaborate  conferencing  simulation  facility  capable  of  running  twenty  user  conferences  dialed 
up  by  touch-tone  control.  This  facility  will  allow  us  to  simulate  all  promising  conferencing 
geometries  and  control  strategies.  I'inally,  we  are  launching  a substantial  effort  to  design 
and  evaluate  bandwidth  efficient  communication  systems  capable  of  voice  and  data  transmission 
using  the  packetized  virtual  circuit  concept. 


ca, OSS  A MV 


AO 

ADl'CM 

\I,I 

AMDI 

M’( 

AMC 

rri) 

c!m; 

( I’l) 

rvsi) 

IX  A 

OIT 

DIP 

DMT 

DVT 

FCL 

Kl'L 

KDP 
I' IT 
1 IK 

R- 

Il'FT 

FDVT 

LPC 

FPCM 

LSI 

MMSE 

MSI 

PCM 

RAM 

ROM 

ssn 

SSBSC 

SSI 

TTF 

VI, SI 


Atiumiilnior 

AdaptiM*  Dill'erential  Pulse  Code  Modulation 
Arithmetic  'Logic  Unit 

Ahsolute  Magnitude  Dilferenee  lunction 
Adaptive  Predictive  Coding 
Adaptive  Residual  Coding 

Charge-Coupled  Device 
Central  Processing  Flement 
Charge- Transfer  Device 

Continuously  Variable  Slope  Delta  Modulation 

Defense  Communications  Agency 
Discrete  Fourier  Transform 
Dual  In-Line  Package 
Diagnostic  Rh^Tne  Test 
Digital  Voice  Terminal 

Fmitter  Coupled  Logic 
Emitter  Follower  Logic 

l ast  Digital  Processor 
Fast  I'ourier  Transform 
Finite  Impulse  Response 

Integrated  Circuit 

Inverse  I'ast  Fourier  Transform 

Lincoln  Digital  Voice  Terminal 
Linear  Predictive  Coding 
Linear  Predictive  C'oding  Microprocessor 
Large-Scale  Integration 

Minimum  Mean-Square  Error 
Medium-Scale  Integration 

Pulse  Code  Modulation 

Random  Access  Memory 
Read-Only  Memory 

Single  Sideband 

Single-Sideband  Suppressed-Carrier  Amplitude  Modulation 
Small-Scale  Integration 

Transistor-Transistor  Logic 

Very  Large-Scale  Integration 


105 


piyK  BLANK-NOT  /IIMSD 


__  _ r\ri.ASsii-ii-n 

‘>iiUKJTY  V I ASSI  ► IC  A TION  0 F T HI  S P ACE  ( B Arn /Ki<u  A. 


^ i REPORT  DOCUMENTATION  PAGE 
1 RtPORi  NiVi*/"  rr 


P 1 HSli-l 


S|X‘i.'di  }*  v.iliKii  ii>n  ^ 


7.  GOVT  accession  NO 


Kl  Al)  ISSI  KL'(  liONS 

HhKOKK  ( OMJM  I-  ll\(.  H>k\1 

3 recipient"  S C a T a U 0 G N Jm  u i-  w 


n.  TYPE  A COvF  we  d’ 

K Aiuiual>(t  !•  N 


6.  PERrORMINC  ORG  REPORT  NUMHFP 


iei'iiai’il/'poli 


8.  CONTRACT  OR  grant  NUMHE  R 


9.  PERFORMING  organization  NAME  ANO  ADDRESS 

Lincoln  l.alx)i*aiurv,  M.l.  1'.  y 

1',  O,  iio\  7.1  ^ 

Lcxuunt)!!,  MA  02173 


M.  CONTROLLING  OFFICE  NAME  ANO  ADDRESS 

Dctcnsc  Cx>mmumcaiion.s  Agency 
Sill  Street  Si  So.  Couriliouse  Road 
Arlington,  VA  22204 

l4.  monitoring  agency  NAME  & ADDRESS  /i/  different  from  Controlling  Officef 

Klcctromc  Sv-stems  Division 
llansconi 

Hoiiford,  \1A  01731 


10.  PROGRAM  ELEMENT  PROJECT,  TASK 
AREA  & WORK  UNIT  NUMBERS 


I’riijiiMin  I-Uuincju  No.  .iil2(iK 


12.  REPORT  DAJ.L 

II  30  Scpl*B»|)i»T  07(1  " 

13.  NUMBER  OF  PAGES  3 , . / 

J . 0-'  - ■ 

15.  security  CLASS.  ioH'fuV 
Unclassified 


15«».  DECLASSIFICATION  DOIVNGRACING 
SCHEDULE 


16.  distribution  statement  N»/ Keport) 


Approved  for  public  release;  distribution  unlimited. 


17.  distribution  statement  (of  the  abstract  entered  in  Block  20,  if  different  from  Report) 


19.  key  words  il'.ontinue  on  reverse  side  if  necessary  and  identify  by  block  number) 


sjK'C’ch  evaluation 
s(ieecli  compression 
voaider  systems 


speech-processing  systems 
Lincoln  Digital  Voice  Terminal 
microprocessor  chip  sets 


voice-excited  systems 
pitcli  detection 
hybrid  packaging 


abstract  fG'nlinu#  on  reverse  side  if  necessary  and  identify  by  block  number) 

I'hls  volume  re(X3rts  the  work  ixirformed  during  FY  76-71'  on  the  DCA  Speedi  Fvalnation  t'Aintract.  Work 
during  this  period  on  .System  Implications  of  Packetized  Speech  for  DC.Y  is  reported  luulcr  separate  cover. 
Three  general  areas  of  work  are  reixDrted  In  this  document:  (1)  work  on  narrowband  terminal  *robustness 
(2)  work  on  wideliand-narrowliand  tandeniing;  and  (3)  hardware  speedi-terminal  efforts. 

The  robustness  issues  are  defined  early  in  this  report;  then,  work  on  teletihone-Une  simulation, 
robust  pitdi  extraction,  and  operation  of  LPC  vocoders  In  acoustically  noisy  environments  is  reiTorted. 

This  re;x>rt  also  discusses  some  approaches  and  progress  made  in  the  improvement  of  widelxmd  devices, 
and  the  interoperability  of  widelxind  and  narrowliand  terminals. 

The  design  and  development  of  a microprocessor-based  LPC  vocotier,  as  well  as  some  work  on  the  de- 
velopment of  charge-tr.insfcr-device-lxiscd  ^anne I- vocoder  equipment,  also  are  descrilx'd. 


OD  1473 

I JAN  73 


EDITION  OF  I NOV  65  IS  OBSOLETE 


unc:i.a.ssii.ii-:d 

SECURITY  CLASSIFICATION  OF  THIS  PAGE  (IKirn  f.il.i  hnlfrdl 


3 


